|
Hi,
we agreed that we would want to have a test set for 4.6 but so far we haven't made any progress on it (as far as I know). I want to try to get this work started by posting here a list of questions I have about the new test set. Please add your own questions and answer any questions you can (no need to try to answer all questions).
- Why do the current tests fail? Is it only because of different floating point rounding or are there other problems? What's the best procedure to find out why a test fails? - Which tests should be part of the new test set?
- Should the current tests all be part of the new test set? - How should the new test be implemented? Should the comparison with the reference value be done in C (within mdrun), ctest script, python or perl?
- Should the new test execute mdrun for each test? Or should we somehow (e.g. python wrapper or within mdrun) load the binary only once and run many test per execution? - What are the requirements for the new test set? E.g. how easy should it be to see whats wrong when a test fails? Should the test support being run under valgrind? Other?
- Do we have any other bugs which have to be solved before the test can be implemented? E.g. is the problem with shared libraries solved? Are there any open redmine issues related to the new test set? - Should we have a policy that everyone who adds a feature also has to provide tests covering those features?
- Should we have a conference call to discuss the test set? If yes when? - Should we agree that we won't release 4.6 without the test set to give it a high priority?
|
|
Hi, all-
My opinion: I think there should probably be two classes of sets -- fast fully automated sets, and more sophisticated content validation sets. For the fast fully automated test, I would suggest: -testing a large range of input .mdps, tops and gros for whether they run through grompp and mdrun. Not performing whether the output is correct or not, because that is very hard to automate -- just testing whether it runs for, say 20-100 steps or so. -testing input files for backwards compatibility. -testing whether tools run on typical data sets. All of these tasks can be easily automated, with unambiguous results (does it run to completion yes/no). Such a test set can (And should) be run by people editing the code by themselves, and should also be tested using something like jenkins to verify that it passes the tests on multiple platforms, either on commit, or more likely as part of a nightly test process. Longer term, we should look more at validating code at a physical level. Clearly testing energy conservation is a good idea for integrators; it's fairly sensitive. I think we need to discuss a bit more about how to evaluate energy conservation. This actually can take a fair amount of time, and I'm thinking this is something that perhaps should wait for 5.0. For thermostats and barostats, I'm working on a very sensitive test of ensemble validity. I'll email a copy to the list when it's ready to go (1-2 weeks?), and this is something that can also be incorporated in an integrated testing regime, but again, these sorts of tests will take multiple hours, not seconds. That sort of testing level can't be part of the day to day build. > - Why do the current tests fail? Is it only because of different floating > point rounding or are there other problems? What's the best procedure to > find out why a test fails? I think there are a lot of reasons that the current tests that do diffs to previous results can fail -- floating point rounding is definitely an issue, but there can be small changes in algorithms that can affect things, or differences in the output format. Perhaps the number of random number calls is changed, and thus different random numbers are used for different functions. > - Should the current tests all be part of the new test set? I'm not so sure about this -- I think we should think a little differently about how to implement them. > - How should the new test be implemented? Should the comparison with the > reference value be done in C (within mdrun), ctest script, python or perl? I think that script would be better. I think we should isolate the test itself from mdrun. > - Should the new test execute mdrun for each test? Or should we somehow > (e.g. python wrapper or within mdrun) load the binary only once and run > many test per execution? I think executing mdrun for each test is fine, unless it slows things down drastically. You could imagine a smaller (10-20) and larger (1000's) of inputs that can be run. > - What are the requirements for the new test set? E.g. how easy should it > be to see whats wrong when a test fails? For the first set of tests, I can imagine that it would be nice to be able to look at the outputs of the tests, and diff different outputs corresponding to different code versions to help track down changes were. But I'm suspicious about making the evaluation of these tests decided on automatically at first. I agree such differences should EVENTUALLY be automated, but I'd prefer several months of investigation and discussion before deciding exactly what "correct" is. > Should the test support being run > under valgrind? Other? Valgrind is incredibly slow and can fail for weird reasons -- I'm not sure it would add much to do it under valgrind. > - Do we have any other bugs which have to be solved before the test can be > implemented? E.g. is the problem with shared libraries solved? Are there > any open redmine issues related to the new test set? I have no clue. > - Should we have a policy that everyone who adds a feature also has to > provide tests covering those features? Yes. > - Should we have a conference call to discuss the test set? If yes when? No idea. > - Should we agree that we won't release 4.6 without the test set to give it > a high priority? I'm OK with having a first pass of the fully atomated test set that I describe above (does it run on all input it's supposed to) I think we can have a beta release even if the test set isn't finished, as that can be fine tuned while beta bugs are being find. I DON'T think we should have any test set that starts to look at more complicated features right now -- it will take months to get that working, and we need to get 4.6 out of the door on the order of weeks, so we can move on to the next improvements. 4.6 doesn't have to be perfectly flawless, as long as it's closer to perfect than 4.5. Best, ~~~~~~~~~~~~ Michael Shirts Assistant Professor Department of Chemical Engineering University of Virginia [hidden email] (434)-243-1821 |
|
Personally, I think that the most important tests are those that
validate the code. See, for example, the tip4p free-energy code bug: http://www.mail-archive.com/gmx-users@.../msg18846.html I think that it is this type of error, which can silently lead to the wrong values, that we really need a test set in order to catch. The gmx-users list is not such a bad way to catch errors that lead to grompp failure. It is not immediately clear, however, who would be responsible for developing such a test (those who added tip4p, those who added optimization loops, or those who added the free energy code). I recall seeing a post indicating that developers would be required to test their code prior to incorporation, but with so many usage options in mdrun I think that it will be essential to figure out how to define that requirement more precisely. Finally, this is not to suggest that such a test set needs to be incorporated into version 4.6. I do appreciate the need to get intermediate versions out the door. Chris. Quoting "Shirts, Michael (mrs5pt)" <[hidden email]>: > Hi, all- > > My opinion: > > I think there should probably be two classes of sets -- fast fully automated > sets, and more sophisticated content validation sets. > > For the fast fully automated test, I would suggest: > -testing a large range of input .mdps, tops and gros for whether they run > through grompp and mdrun. Not performing whether the output is correct or > not, because that is very hard to automate -- just testing whether it runs > for, say 20-100 steps or so. > -testing input files for backwards compatibility. > -testing whether tools run on typical data sets. > > All of these tasks can be easily automated, with unambiguous results (does > it run to completion yes/no). > > Such a test set can (And should) be run by people editing the code by > themselves, and should also be tested using something like jenkins to verify > that it passes the tests on multiple platforms, either on commit, or more > likely as part of a nightly test process. > > Longer term, we should look more at validating code at a physical level. > Clearly testing energy conservation is a good idea for integrators; it's > fairly sensitive. I think we need to discuss a bit more about how to > evaluate energy conservation. This actually can take a fair amount of time, > and I'm thinking this is something that perhaps should wait for 5.0. For > thermostats and barostats, I'm working on a very sensitive test of ensemble > validity. I'll email a copy to the list when it's ready to go (1-2 weeks?), > and this is something that can also be incorporated in an integrated testing > regime, but again, these sorts of tests will take multiple hours, not > seconds. That sort of testing level can't be part of the day to day build. > >> - Why do the current tests fail? Is it only because of different floating >> point rounding or are there other problems? What's the best procedure to >> find out why a test fails? > > I think there are a lot of reasons that the current tests that do diffs to > previous results can fail -- floating point rounding is definitely an issue, > but there can be small changes in algorithms that can affect things, or > differences in the output format. Perhaps the number of random number calls > is changed, and thus different random numbers are used for different > functions. > >> - Should the current tests all be part of the new test set? > > I'm not so sure about this -- I think we should think a little differently > about how to implement them. > >> - How should the new test be implemented? Should the comparison with the >> reference value be done in C (within mdrun), ctest script, python or perl? > > I think that script would be better. I think we should isolate the test > itself from mdrun. > >> - Should the new test execute mdrun for each test? Or should we somehow >> (e.g. python wrapper or within mdrun) load the binary only once and run >> many test per execution? > > I think executing mdrun for each test is fine, unless it slows things down > drastically. You could imagine a smaller (10-20) and larger (1000's) of > inputs that can be run. > >> - What are the requirements for the new test set? E.g. how easy should it >> be to see whats wrong when a test fails? > > For the first set of tests, I can imagine that it would be nice to be able > to look at the outputs of the tests, and diff different outputs > corresponding to different code versions to help track down changes were. > But I'm suspicious about making the evaluation of these tests decided on > automatically at first. I agree such differences should EVENTUALLY be > automated, but I'd prefer several months of investigation and discussion > before deciding exactly what "correct" is. > >> Should the test support being run >> under valgrind? Other? > > Valgrind is incredibly slow and can fail for weird reasons -- I'm not sure > it would add much to do it under valgrind. > >> - Do we have any other bugs which have to be solved before the test can be >> implemented? E.g. is the problem with shared libraries solved? Are there >> any open redmine issues related to the new test set? > > I have no clue. > >> - Should we have a policy that everyone who adds a feature also has to >> provide tests covering those features? > > Yes. > >> - Should we have a conference call to discuss the test set? If yes when? > No idea. > >> - Should we agree that we won't release 4.6 without the test set to give it >> a high priority? > > I'm OK with having a first pass of the fully atomated test set that I > describe above (does it run on all input it's supposed to) I think we can > have a beta release even if the test set isn't finished, as that can be fine > tuned while beta bugs are being find. > > I DON'T think we should have any test set that starts to look at more > complicated features right now -- it will take months to get that working, > and we need to get 4.6 out of the door on the order of weeks, so we can move > on to the next improvements. 4.6 doesn't have to be perfectly flawless, as > long as it's closer to perfect than 4.5. > > Best, > ~~~~~~~~~~~~ > Michael Shirts > Assistant Professor > Department of Chemical Engineering > University of Virginia > [hidden email] > (434)-243-1821 > > > > -- > gmx-developers mailing list > [hidden email] > http://lists.gromacs.org/mailman/listinfo/gmx-developers > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to [hidden email]. > |
|
Hi, all-
> Personally, I think that the most important tests are those that > validate the code. See, for example, the tip4p free-energy code bug: > > http://www.mail-archive.com/gmx-users@.../msg18846.html > > I think that it is this type of error, which can silently lead to the > wrong values, that we really need a test set in order to catch. The > gmx-users list is not such a bad way to catch errors that lead to > grompp failure. > > It is not immediately clear, however, who would be responsible for > developing such a test (those who added tip4p, those who added > optimization loops, or those who added the free energy code). I recall > seeing a post indicating that developers would be required to test > their code prior to incorporation, but with so many usage options in > mdrun I think that it will be essential to figure out how to define > that requirement more precisely. I agree that these tests are much harder to catch, and these are the sorts of tests that can be developed on a longer time scale. I strongly suspect that some of these bugs will actually cause run failures for some combination of input parameters, so a wide enough set of "does it run" tests will catch them -- if it's doing something wrong, there is a decent chance it will cause a catastrophic failure for some set of options. Other bugs will be almost impossible to catch looking at Gromacs alone, because they will do things that are physically consistent, but doing something other than one thinks they should be doing. I'm in discussions with some other chemical engineers about how to perform this type of validation -- basically, the best bet is to have databases of results of particular molecular models obtained on lots of different softwares. NIST has some interest in this (for example, they have a database of Lennard-Jonesium results), so hopefully something comes out of this sort of collaboration in the next few years. Obviously, not something that can be prepared for 4.6. Best, ~~~~~~~~~~~~ Michael Shirts Assistant Professor Department of Chemical Engineering University of Virginia [hidden email] (434)-243-1821 |
|
In reply to this post by Roland Schulz
Hi,
I agree test sets are very important. Having good tests will make development and especially the process of accepting contributions much easier. Now that we have the new, by default, energy conserving loops, I realize that energy conservation is extremely useful for validation. I think that having tests that check energy conservation and particular energy values of particular (combinations of) functionality will catch a lot of problems. The problems is that MD is chaotic and with non energy-conserving setups the divergence is extremely fast. With energy conservation running 20 steps with nstlist=10, checking the conserved energy + a few terms would be enough for testing most modules, I think. We still want some more extended tests, but that could be a separate set. So setting up a framework for the simple tests should not be too hard. Then we need to come up with a set of tests and reference values. Cheers, Berk On 02/05/2012 04:56 AM, Roland Schulz wrote: Hi, |
|
Hi,
It will certainly be easier to have tests with close-to-perfect conservation. However, it's too late in the cycle to decide that we're not going to release anything without tests. We should still do them, but I would like to see two separate parts: 1) Low-level tests that specifically check the output for several sets of input for a *module*, i.e. calling routines rather than running a simulation. The point is that this will isolate errors either to a specific module, or to modules it depends on. However, when those modules too should have tests it will be a 5-min job to find what file+routine the bug is likely to be in. 2) Higher-level tests that check whether this feature appears to work in practice in a simulation. The point of these tests is mostly to make sure other new features don't break our module. Right now we have a bit of (2), but almost no (1). Cheers, Erik On Feb 6, 2012, at 4:17 PM, Berk Hess wrote:
-- Erik Lindahl <[hidden email]> Professor of Theoretical & Computational Biophysics Department of Theoretical Physics & Swedish e-Science Research Center Royal Institute of Technology, Stockholm, Sweden Tel1: +46855378029 Tel2: +468164675 Cell: +46734618050 -- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
In reply to this post by Berk Hess-4
Hi,
On Fri, Feb 10, 2012 at 12:45 PM, Erik Lindahl <[hidden email]> wrote:
Yes. It will also help if the test set will be always up to date. Jenkins should guarantee that by forcing developers to change the expected result whenever a different algorithm doesn't have a binary identical result.
But I think we have to somehow give it a high priority and make sure a large percentage of us developers are contributing to this tasks. I don't think it can be accomplished by just a few people. When a large part of basic code is rewritten in C++, we should already have the tests to know when we create regression bugs.
What framework should be used to write these unit tests? Should those be written using GoogleTest as those tests written by Teemu? This would mean that the tests only compile with C++ but I don't think this would be a problem.
How should we run these integration tests? Should we run them similar to how the current test-set is run? I.e have scripts which run pdb2gmx, grompp, mdrun and parse the output for the results and errors. If so do we want to base it onto the existing perl scripts, have some existing external framework or write some new scripts from scratch?
Roland
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov 865-241-1537, ORNL PO BOX 2008 MS6309 -- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
In reply to this post by Roland Schulz
On Mon, Feb 6, 2012 at 10:17 AM, Berk Hess <[hidden email]> wrote:
What tests do you have in mind for these simple tests? Would these be all integration tests? Roland
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov 865-241-1537, ORNL PO BOX 2008 MS6309 -- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
In reply to this post by Roland Schulz
On Sun, Feb 5, 2012 at 3:53 PM, Shirts, Michael (mrs5pt) <[hidden email]> wrote: Hi, all- Yes having that set of inputs is needed. Should we start a wiki page to start listing all the inputs we want to include? Or what would be the best way to collaborative creating this set of inputs?
Well even if the tests take ~2000CPU-hours, I think we (maybe even Erik by himself) have the resources to run this weekly.
I think a wrong reference value is better than no reference value. Even a wrong reference value would allow us to detect if e.g. different compilers give significant different results (maybe some give the correct value). Also it would help to avoid adding additional bugs. Of course we shouldn't release the test set to the outside before we are relative sure that it actually correct.
I have the current C++ tests (those written by Teemu) running under valgrind in Jenkins. It wasn't very hard to write a few suppression rules to make valgrind not report any false positives. Now Jenkins can automatically fail the build if the code has any memory errors. Obviously one woudn't run any of the long running tests with valgrind. But for the unit tests I think it might be very useful to catch bugs.
I DON'T think we should have any test set that starts to look at more My reason for delaying the 4.6 release would not be to improve the 4.6 release. I agree with you we probably can't guarantee that the reference value are correct in time anyhow, so we probably wouldn't even want to ship the tests with 4.6. My worry is that as soon as 4.6 is out the focus is on adding new cool features instead of working on these boring tasks we should do, because they help us in the long run. E.g. if we would have agreed that we don't have a 4.6 release, the C++ conversion would most likely be much further along. And I don't see how we can create an incentive mechanism to work on these issues without somehow coupling it to releases.
Roland
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov 865-241-1537, ORNL PO BOX 2008 MS6309 -- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
In reply to this post by Roland Schulz
> A while ago I talked about this with Nicu who works with Siewert Jan
> Marrink. He is a software engineer by training and suggested the > following. For each import parameter you take extreme values (e.g. > timestep 5 fs and 0.001 fs) and a random value in between. I'm not sure this make sense, because lots of things will fail at 5 fs that SHOULD fail (Though perhaps we need to make the failure more graceful). And it's not clear that simulations run between 1 and 0.001 fs add any information. So I think that each parameter is a bit more complicated. Best, ~~~~~~~~~~~~ Michael Shirts Assistant Professor Department of Chemical Engineering University of Virginia [hidden email] (434)-243-1821 -- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
In reply to this post by Roland Schulz
> Yes having that set of inputs is needed. Should we start a wiki page to > start listing all the inputs we want to include? Or what would be the best > way to collaborative creating this set of inputs? A wiki page is good. I can commit to spending an hour or so this weekend discussing each parameter and what issues can come up. I can start putting together a variety of .mdp and .top files. To what extent can we start with the older test suite, and modify it? > Well even if the tests take ~2000CPU-hours, I think we (maybe even Erik by > himself) have the resources to run this weekly. We can definitely come up with a large number if physical tests that can come in under that amount of time. >> For the first set of tests, I can imagine that it would be nice to be able >> to look at the outputs of the tests, and diff different outputs >> corresponding to different code versions to help track down changes were. >> But I'm suspicious about making the evaluation of these tests decided on >> automatically at first. I agree such differences should EVENTUALLY be >> automated, but I'd prefer several months of investigation and discussion >> before deciding exactly what "correct" is. >> > > I think a wrong reference value is better than no reference value. Even a > wrong reference value would allow us to detect if e.g. different compilers > give significant different results (maybe some give the correct value). > Also it would help to avoid adding additional bugs. Of course we shouldn't > release the test set to the outside before we are relative sure that it > actually correct. I'm just saying that we shouldn't, at first, have automatic failures if the reference values change. I very much agree with SAVING the results with each build so that changes can be tracked down. > I have the current C++ tests (those written by Teemu) running under > valgrind in Jenkins. It wasn't very hard to write a few suppression rules > to make valgrind not report any false positives. Now Jenkins > can automatically fail the build if the code has any memory errors. > Obviously one woudn't run any of the long running tests with valgrind. But > for the unit tests I think it might be very useful to catch bugs. In that case, running some subset of the cases under valgrid certainly makes sense. > My reason for delaying the 4.6 release would not be to improve the 4.6 > release. I agree with you we probably can't guarantee that the reference > value are correct in time anyhow, so we probably wouldn't even want to ship > the tests with 4.6. My worry is that as soon as 4.6 is out the focus is on > adding new cool features instead of working on these boring tasks we should > do, because they help us in the long run. E.g. if we would have agreed that > we don't have a 4.6 release, the C++ conversion would most likely be much > further along. And I don't see how we can create an incentive mechanism to > work on these issues without somehow coupling it to releases. But if we talk about releases without deciding on anything, then everybody keeps developing new stuff. At some point, we need to agree on what conditions 4.6 will statisfy, and get it out the door, so everyone that is developing new stuff has to do it within the context of master. Best, ~~~~~~~~~~~~ Michael Shirts Assistant Professor Department of Chemical Engineering University of Virginia [hidden email] (434)-243-1821 -- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
In reply to this post by Erik Lindahl
> 1) Low-level tests that specifically check the output for several sets of
> input for a *module*, i.e. calling routines rather than running a simulation. > The point is that this will isolate errors either to a specific module, or to > modules it depends on. However, when those modules too should have tests it > will be a 5-min job to find what file+routine the bug is likely to be in. I'm not exactly sure how this works. If we test modules, we have to be writing a bunch of new code that interacts with the modules directly, and so we may miss things that happen in actual simulation cases. I sort of favor just actually running grompp and mdrun, because the errors that occur will be the errors that people actually see. I haven't found an error yet that is particularly hard to isolate to a given file pretty quickly once it is identified. Perhaps for particular aspects things (testing that dozens of inner loops give consistent numbers) this makes sense, but I'm not sure it makes sense for everything. I'm not sure how you _just_ test pressure control, for example. Best, ~~~~~~~~~~~~ Michael Shirts Assistant Professor Department of Chemical Engineering University of Virginia [hidden email] (434)-243-1821 -- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
In reply to this post by Shirts, Michael (mrs5pt)
On 2012-02-11 17:22, Shirts, Michael (mrs5pt) wrote:
>> A while ago I talked about this with Nicu who works with Siewert Jan >> Marrink. He is a software engineer by training and suggested the >> following. For each import parameter you take extreme values (e.g. >> timestep 5 fs and 0.001 fs) and a random value in between. > > I'm not sure this make sense, because lots of things will fail at 5 fs that > SHOULD fail (Though perhaps we need to make the failure more graceful). And > it's not clear that simulations run between 1 and 0.001 fs add any > information. So I think that each parameter is a bit more complicated. > parameter combinations. The proposed setup makes that all settings are tested, although obviously not all combinations. And about time steps, we previously had problems where a shorter time step gave worse energy conservation (also beside the point). > Best, > ~~~~~~~~~~~~ > Michael Shirts > Assistant Professor > Department of Chemical Engineering > University of Virginia > [hidden email] > (434)-243-1821 > > > -- David van der Spoel, Ph.D., Professor of Biology Dept. of Cell & Molec. Biol., Uppsala University. Box 596, 75124 Uppsala, Sweden. Phone: +46184714205. [hidden email] http://folding.bmc.uu.se -- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
> That's beside the point. The point is there are 100^10 different mdp
> parameter combinations. The proposed setup makes that all settings are > tested, although obviously not all combinations. Agreed. But picking extremes and random values is not necessarily the way to do it, because of the 100^10, maybe 0.4995*100^10 are SUPPOSED to crash, another 0.4995*10^10 are essentially duplicating each other. Which leaves 100^8 options to test, but random/extreme selection won't necessarily pick out the right fraction. A variety of this will likely work work is for EACH parameter, pick out the 1-5 options that really represent something different and should give valid results, then randomly select from the 100^5 combinations, with rules to filter out invalid combinations. I can't seem to edit the wiki, so I can't start a page on this, but perhaps we should start with the mdout.mdp, and start dissecting exactly what combinations of options need to be tested, with which different topologies, and see how many combinations we actually get. Roland (or other person), perhaps under "Building and Testing" there should be a third section where we discuss this? > And about time steps, we previously had problems where a shorter time > step gave worse energy conservation (also beside the point). But that will require longer physical simulations to detect. Best, ~~~~~~~~~~~~ Michael Shirts Assistant Professor Department of Chemical Engineering University of Virginia [hidden email] (434)-243-1821 -- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
In reply to this post by Roland Schulz
Hi
I have a few comments and suggestions that I hope will help. I have been designing software and building products for 30 years in industry in the area of compilers, optimizers, databases, imaging and other software where testing is a critical issue. In many cases this involved thousands or even millions of copies. I have been working on performance improvements to GROMACS and the difficulties of knowing if I broke something has been a big issue.
1) Unit and system test suites are developed as products go through release cycles and as bugs are being fixed. Trying to build tests after a release or near the end of a release cycle does not work because of limitations on resources, the desire to get the release out or because of work on future releases. 2) The important thing for 4.6 is to catch the big things and priority 1 bugs through the GROMACS community before it goes out and put in place a simple manageable process going forward for testing. I would not release it with a priority 1 bug. (e.g. if Reaction Field is broken I would not release 4.6). 3) If conventions for a test framework can quickly be established and the developers of 4.6 functionality and recent critical bug fixes have tests that they have used for their work that may reasonably be adapted to the new test framework they should be added. 4) The foundation of test suites is the unit/module level. If the base is not sound then the system/simulation will not be sound. At the module level it is almost always possible to determine what the 'right' answer is with relatively small datasets in short amounts of time for many parameter sets. A set of pass/fail test programs at module level is a good long term goal. It seems unrealistic to me to try to go back and build module level tests for GROMACS with limited resources, money and time. It is more realistic to add tests as bugs are fixed and new features are added. 5) Keep the test framework very simple. There are number of 'test environments' and products available. I have used a lot of them in commercial product development I would not recommend the use of any of them for GROMACS. The US National Institute of Standards and Technology has test suites for compilers, databases and many other software technologies consisting of numerous programs that test functionality at a module level where each test program performs multiple tests sequentially, logs the results as text and returns a program exit indicating pass/fail. Test programs are typically grouped by functional area and run as a set of scripts. The tests that fail are clearly identified in the log and reading the test program code is simple because the tests are executed sequentially. Sometimes there are input/output files that are validated against a reference file. This simple model is used extensively for ANSI Standards compliance. Using this model unit/module level testing would be easy and application/simulation level tests could be run as scripts with diffs with configurable 'tolerances' with the same style of logging and pass/fail return method. What would be needed: Test program and test script templates, conventions for test logging, individual program/script exit (pass/fail) , conventions for dataset naming and configurable tolerances for higher level application/simulation tests. The process also would need to be managed. 6) Decisions about how to test functionality at the module level in industry is distributed among the developers that work to build initial or new features and to fix bugs. For future releases of GROMACS developers should be required to add regression test programs that will verify on a pass/fail basis all new functionality and priority 1 bug fixes.
Developers usually have the basic logic of such test programs in their own test programs, simulations, files and criteria that they use to validate their own development. Their test work is usually thrown away. It should not be too time consuming to use this code as the basis of regression tests in the future if developers understand that it is a requirement for code submission. Tests must be designed and the developers implementing new features and fixing bugs should take primary responsibility for how to test changes. Unfortunately this takes time, costs money, slows the development process and publication process. 7) It is important to try to get a second opinion about how something important is to be fixed or a new feature is to be added and have someone 'sign off' on another person's work and test program/script prior to incorporation into a release. It would be good to integrate this 'signoff' into the process.
Regards, David On Sat, Feb 11, 2012 at 7:35 AM, David van der Spoel <[hidden email]> wrote:
-- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
In reply to this post by Shirts, Michael (mrs5pt)
Hi Michael,
On 2012-02-11 18.16, Shirts, Michael (mrs5pt) wrote: > I can't seem to edit the wiki, so I can't start a page on this, You should be able to do it now. Cheers, Rossen -- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
In reply to this post by Erik Lindahl
On Sat, Feb 11, 2012 at 11:31 AM, Shirts, Michael (mrs5pt) <[hidden email]> wrote:
One could read virial values from a data file and pass it to the pressure control. And then check that the pressure control is behaving as it should. I think unit tests would be very useful for the planned conversion to C++. We will have to convert individual modules one at a time and I assume that the overall program will be broken often. In that case unit tests might enable us to test individual modules without the overall programming working.
Roland
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov 865-241-1537, ORNL PO BOX 2008 MS6309 -- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
|
Roland, Erik, et.al.
I am not sure if my posting was well understood so I will give an example.
When one has an extremely complex piece of software that has few tests and there is a need to be constantly adding new features there are only a couple of ways of testing it. It can not be well tested at the higher levels only at the module level.
For example, I would liken the testing of GROMACS to a high level language compiler designed with complex syntax, many modules and many options. Compilers have language features, syntax that require runtime support and finction libraries. They also have supporting library routines or support programs. No one would ever try to make a test suite for high level use cases. What is done is to create separate test programs that exercise language syntax that can be tested without direct calls to internal modules or directly to library routines (e.g. nonbonded, etc.). Separate tests programs are written for things like object file creation. These tests have 'right' answers. No attempt is made to test 'use cases' except in the case of priority 1 bugs. This reduces the scope of testing to a manageable and defined level. It requires breaking the program into functional groups, internal routines and external routines and support programs. This is not difficult to do with GROMACS. GROMACS has .mdp options that amount to 'lanaguage' syntax, internal code routines, external programs and file formats like a compiler. This is the test approach that I think should be taken.
In cases where I have worked on compilers that were 'legacy' and mature products we prioritized the removal of all serious P1 and P2 bugs and adding new features. When new features were added developers started adding higher level tests. In response to serious P1 bugs in 'broken generated programs' 'e.g. simulations' user provided code (or simuations) were added to the test suite. Onces I inherited such a compiler. We had many new features to add, many bugs, little documenation, virtually no test suite and a short time to release. With 5 people we added all the new features in 5 months, fixed all the priority 1 and 2 bugs, and built test programs that exercised or called directly all key core modules.
I hope this helps.
Cheers
David
Sat, Feb 11, 2012 at 7:45 PM, Roland Schulz <[hidden email]> wrote:
-- gmx-developers mailing list [hidden email] http://lists.gromacs.org/mailman/listinfo/gmx-developers Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email]. |
| Powered by Nabble | Edit this page |
