Quantcast

New Test Set

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

New Test Set

Roland Schulz
Hi,

we agreed that we would want to have a test set for 4.6 but so far we haven't made any progress on it (as far as I know). I want to try to get this work started by posting here a list of questions I have about the new test set. Please add your own questions and answer any questions you can (no need to try to answer all questions).

- Why do the current tests fail? Is it only because of different floating point rounding or are there other problems? What's the best procedure to find out why a test fails?
- Which tests should be part of the new test set? 
- Should the current tests all be part of the new test set?
- How should the new test be implemented? Should the comparison with the reference value be done in C (within mdrun), ctest script, python or perl?
- Should the new test execute mdrun for each test? Or should we somehow (e.g. python wrapper or within mdrun) load the binary only once and run many test per execution?
- What are the requirements for the new test set? E.g. how easy should it be to see whats wrong when a test fails? Should the test support being run under valgrind? Other?
- Do we have any other bugs which have to be solved before the test can be implemented? E.g. is the problem with shared libraries solved? Are there any open redmine issues related to the new test set?
- Should we have a policy that everyone who adds a feature also has to provide tests covering those features?
- Should we have a conference call to discuss the test set? If yes when?
- Should we agree that we won't release 4.6 without the test set to give it a high priority?
Roland

--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Shirts, Michael (mrs5pt)
Hi, all-

My opinion:

I think there should probably be two classes of sets -- fast fully automated
sets, and more sophisticated content validation sets.

For the fast fully automated test, I would suggest:
-testing a large range of input .mdps, tops and gros for whether they run
through grompp and mdrun. Not performing whether the output is correct or
not, because that is very hard to automate -- just testing whether it runs
for, say 20-100 steps or so.
-testing input files for backwards compatibility.
-testing whether tools run on typical data sets.

All of these tasks can be easily automated, with unambiguous results (does
it run to completion yes/no).

Such a test set can (And should) be run by people editing the code by
themselves, and should also be tested using something like jenkins to verify
that it passes the tests on multiple platforms, either on commit, or more
likely as part of a nightly test process.

Longer term, we should look more at validating code at a physical level.
Clearly testing energy conservation is a good idea for integrators; it's
fairly sensitive.  I think we need to discuss a bit more about how to
evaluate energy conservation.  This actually can take a fair amount of time,
and I'm thinking this is something that perhaps should wait for 5.0.  For
thermostats and barostats, I'm working on a very sensitive test of ensemble
validity. I'll email a copy to the list when it's ready to go (1-2 weeks?),
and this is something that can also be incorporated in an integrated testing
regime, but again, these sorts of tests will take multiple hours, not
seconds.   That sort of testing level can't be part of the day to day build.

> - Why do the current tests fail? Is it only because of different floating
> point rounding or are there other problems? What's the best procedure to
> find out why a test fails?

I think there are a lot of reasons that the current tests that do diffs to
previous results can fail -- floating point rounding is definitely an issue,
but there can be small changes in algorithms that can affect things, or
differences in the output format.  Perhaps the number of random number calls
is changed, and thus different random numbers are used for different
functions.

> - Should the current tests all be part of the new test set?

I'm not so sure about this -- I think we should think a little differently
about how to implement them.

> - How should the new test be implemented? Should the comparison with the
> reference value be done in C (within mdrun), ctest script, python or perl?

I think that script would be better.  I think we should isolate the test
itself from mdrun.

> - Should the new test execute mdrun for each test? Or should we somehow
> (e.g. python wrapper or within mdrun) load the binary only once and run
> many test per execution?

I think executing mdrun for each test is fine, unless it slows things down
drastically.  You could imagine a smaller (10-20) and larger (1000's) of
inputs that can be run.

> - What are the requirements for the new test set? E.g. how easy should it
> be to see whats wrong when a test fails?

For the first set of tests, I can imagine that it would be nice to be able
to look at the outputs of the tests, and diff different outputs
corresponding to different code versions to help track down changes were.
But I'm suspicious about making the evaluation of these tests decided on
automatically at first.  I agree such differences should EVENTUALLY be
automated, but I'd prefer several months of investigation and discussion
before deciding exactly what "correct" is.

> Should the test support being run
> under valgrind? Other?

Valgrind is incredibly slow and can fail for weird reasons -- I'm not sure
it would add much to do it under valgrind.

> - Do we have any other bugs which have to be solved before the test can be
> implemented? E.g. is the problem with shared libraries solved? Are there
> any open redmine issues related to the new test set?

I have no clue.

> - Should we have a policy that everyone who adds a feature also has to
> provide tests covering those features?

Yes.

> - Should we have a conference call to discuss the test set? If yes when?
No idea.

> - Should we agree that we won't release 4.6 without the test set to give it
> a high priority?

I'm OK with having a first pass of the fully atomated test set that I
describe above (does it run on all input it's supposed to)  I think we can
have a beta release even if the test set isn't finished, as that can be fine
tuned while beta bugs are being find.

I DON'T think we should have any test set that starts to look at more
complicated features right now -- it will take months to get that working,
and we need to get 4.6 out of the door on the order of weeks, so we can move
on to the next improvements.  4.6 doesn't have to be perfectly flawless, as
long as it's closer to perfect than 4.5.

Best,
~~~~~~~~~~~~
Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
[hidden email]
(434)-243-1821




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Chris Neale
Personally, I think that the most important tests are those that  
validate the code. See, for example, the tip4p free-energy code bug:

http://www.mail-archive.com/gmx-users@.../msg18846.html

I think that it is this type of error, which can silently lead to the  
wrong values, that we really need a test set in order to catch. The  
gmx-users list is not such a bad way to catch errors that lead to  
grompp failure.

It is not immediately clear, however, who would be responsible for  
developing such a test (those who added tip4p, those who added  
optimization loops, or those who added the free energy code). I recall  
seeing a post indicating that developers would be required to test  
their code prior to incorporation, but with so many usage options in  
mdrun I think that it will be essential to figure out how to define  
that requirement more precisely.

Finally, this is not to suggest that such a test set needs to be  
incorporated into version 4.6. I do appreciate the need to get  
intermediate versions out the door.

Chris.


Quoting "Shirts, Michael (mrs5pt)" <[hidden email]>:

> Hi, all-
>
> My opinion:
>
> I think there should probably be two classes of sets -- fast fully automated
> sets, and more sophisticated content validation sets.
>
> For the fast fully automated test, I would suggest:
> -testing a large range of input .mdps, tops and gros for whether they run
> through grompp and mdrun. Not performing whether the output is correct or
> not, because that is very hard to automate -- just testing whether it runs
> for, say 20-100 steps or so.
> -testing input files for backwards compatibility.
> -testing whether tools run on typical data sets.
>
> All of these tasks can be easily automated, with unambiguous results (does
> it run to completion yes/no).
>
> Such a test set can (And should) be run by people editing the code by
> themselves, and should also be tested using something like jenkins to verify
> that it passes the tests on multiple platforms, either on commit, or more
> likely as part of a nightly test process.
>
> Longer term, we should look more at validating code at a physical level.
> Clearly testing energy conservation is a good idea for integrators; it's
> fairly sensitive.  I think we need to discuss a bit more about how to
> evaluate energy conservation.  This actually can take a fair amount of time,
> and I'm thinking this is something that perhaps should wait for 5.0.  For
> thermostats and barostats, I'm working on a very sensitive test of ensemble
> validity. I'll email a copy to the list when it's ready to go (1-2 weeks?),
> and this is something that can also be incorporated in an integrated testing
> regime, but again, these sorts of tests will take multiple hours, not
> seconds.   That sort of testing level can't be part of the day to day build.
>
>> - Why do the current tests fail? Is it only because of different floating
>> point rounding or are there other problems? What's the best procedure to
>> find out why a test fails?
>
> I think there are a lot of reasons that the current tests that do diffs to
> previous results can fail -- floating point rounding is definitely an issue,
> but there can be small changes in algorithms that can affect things, or
> differences in the output format.  Perhaps the number of random number calls
> is changed, and thus different random numbers are used for different
> functions.
>
>> - Should the current tests all be part of the new test set?
>
> I'm not so sure about this -- I think we should think a little differently
> about how to implement them.
>
>> - How should the new test be implemented? Should the comparison with the
>> reference value be done in C (within mdrun), ctest script, python or perl?
>
> I think that script would be better.  I think we should isolate the test
> itself from mdrun.
>
>> - Should the new test execute mdrun for each test? Or should we somehow
>> (e.g. python wrapper or within mdrun) load the binary only once and run
>> many test per execution?
>
> I think executing mdrun for each test is fine, unless it slows things down
> drastically.  You could imagine a smaller (10-20) and larger (1000's) of
> inputs that can be run.
>
>> - What are the requirements for the new test set? E.g. how easy should it
>> be to see whats wrong when a test fails?
>
> For the first set of tests, I can imagine that it would be nice to be able
> to look at the outputs of the tests, and diff different outputs
> corresponding to different code versions to help track down changes were.
> But I'm suspicious about making the evaluation of these tests decided on
> automatically at first.  I agree such differences should EVENTUALLY be
> automated, but I'd prefer several months of investigation and discussion
> before deciding exactly what "correct" is.
>
>> Should the test support being run
>> under valgrind? Other?
>
> Valgrind is incredibly slow and can fail for weird reasons -- I'm not sure
> it would add much to do it under valgrind.
>
>> - Do we have any other bugs which have to be solved before the test can be
>> implemented? E.g. is the problem with shared libraries solved? Are there
>> any open redmine issues related to the new test set?
>
> I have no clue.
>
>> - Should we have a policy that everyone who adds a feature also has to
>> provide tests covering those features?
>
> Yes.
>
>> - Should we have a conference call to discuss the test set? If yes when?
> No idea.
>
>> - Should we agree that we won't release 4.6 without the test set to give it
>> a high priority?
>
> I'm OK with having a first pass of the fully atomated test set that I
> describe above (does it run on all input it's supposed to)  I think we can
> have a beta release even if the test set isn't finished, as that can be fine
> tuned while beta bugs are being find.
>
> I DON'T think we should have any test set that starts to look at more
> complicated features right now -- it will take months to get that working,
> and we need to get 4.6 out of the door on the order of weeks, so we can move
> on to the next improvements.  4.6 doesn't have to be perfectly flawless, as
> long as it's closer to perfect than 4.5.
>
> Best,
> ~~~~~~~~~~~~
> Michael Shirts
> Assistant Professor
> Department of Chemical Engineering
> University of Virginia
> [hidden email]
> (434)-243-1821
>
>
>
> --
> gmx-developers mailing list
> [hidden email]
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to [hidden email].
>




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Shirts, Michael (mrs5pt)
Hi, all-

> Personally, I think that the most important tests are those that
> validate the code. See, for example, the tip4p free-energy code bug:
>
> http://www.mail-archive.com/gmx-users@.../msg18846.html
>
> I think that it is this type of error, which can silently lead to the
> wrong values, that we really need a test set in order to catch. The
> gmx-users list is not such a bad way to catch errors that lead to
> grompp failure.
>
> It is not immediately clear, however, who would be responsible for
> developing such a test (those who added tip4p, those who added
> optimization loops, or those who added the free energy code). I recall
> seeing a post indicating that developers would be required to test
> their code prior to incorporation, but with so many usage options in
> mdrun I think that it will be essential to figure out how to define
> that requirement more precisely.

I agree that these tests are much harder to catch, and these are the sorts
of tests that can be developed on a longer time scale.  I strongly suspect
that some of these bugs will actually cause run failures for some
combination of input parameters, so a wide enough set of "does it run" tests
will catch them -- if it's doing something wrong, there is a decent chance
it will cause a catastrophic failure for some set of options.

Other bugs will be almost impossible to catch looking at Gromacs alone,
because they will do things that are physically consistent, but doing
something other than one thinks they should be doing. I'm in discussions
with some other chemical engineers about how to perform this type of
validation -- basically, the best bet is to have databases of results of
particular molecular models obtained on lots of different softwares.  NIST
has some interest in this (for example, they have a database of
Lennard-Jonesium results), so hopefully something comes out of this sort of
collaboration in the next few years.    Obviously, not something that can be
prepared for 4.6.

Best,
~~~~~~~~~~~~
Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
[hidden email]
(434)-243-1821



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Berk Hess-4
In reply to this post by Roland Schulz
Hi,

I agree test sets are very important.
Having good tests will make development and especially the process of accepting contributions much easier.

Now that we have the new, by default, energy conserving loops, I realize that energy conservation
is extremely useful for validation. I think that having tests that check energy conservation and particular
energy values of particular (combinations of) functionality will catch a lot of problems.
The problems is that MD is chaotic and with non energy-conserving setups the divergence is extremely fast.
With energy conservation running 20 steps with nstlist=10, checking the conserved energy + a few terms
would be enough for testing most modules, I think.
We still want some more extended tests, but that could be a separate set.

So setting up a framework for the simple tests should not be too hard.
Then we need to come up with a set of tests and reference values.

Cheers,

Berk

On 02/05/2012 04:56 AM, Roland Schulz wrote:
Hi,

we agreed that we would want to have a test set for 4.6 but so far we haven't made any progress on it (as far as I know). I want to try to get this work started by posting here a list of questions I have about the new test set. Please add your own questions and answer any questions you can (no need to try to answer all questions).

- Why do the current tests fail? Is it only because of different floating point rounding or are there other problems? What's the best procedure to find out why a test fails?
- Which tests should be part of the new test set? 
- Should the current tests all be part of the new test set?
- How should the new test be implemented? Should the comparison with the reference value be done in C (within mdrun), ctest script, python or perl?
- Should the new test execute mdrun for each test? Or should we somehow (e.g. python wrapper or within mdrun) load the binary only once and run many test per execution?
- What are the requirements for the new test set? E.g. how easy should it be to see whats wrong when a test fails? Should the test support being run under valgrind? Other?
- Do we have any other bugs which have to be solved before the test can be implemented? E.g. is the problem with shared libraries solved? Are there any open redmine issues related to the new test set?
- Should we have a policy that everyone who adds a feature also has to provide tests covering those features?
- Should we have a conference call to discuss the test set? If yes when?
- Should we agree that we won't release 4.6 without the test set to give it a high priority?
Roland

--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Erik Lindahl
Hi,

It will certainly be easier to have tests with close-to-perfect conservation.

However, it's too late in the cycle to decide that we're not going to release anything without tests. We should still do them, but I would like to see two separate parts:

1) Low-level tests that specifically check the output for several sets of input for a *module*, i.e. calling routines rather than running a simulation. The point is that this will isolate errors either to a specific module, or to modules it depends on. However, when those modules too should have tests it will be a 5-min job to find what file+routine the bug is likely to be in. 

2) Higher-level tests that check whether this feature appears to work in practice in a simulation. The point of these tests is mostly to make sure other new features don't break our module.

Right now we have a bit of (2), but almost no (1).

Cheers,

Erik



On Feb 6, 2012, at 4:17 PM, Berk Hess wrote:

Hi,

I agree test sets are very important.
Having good tests will make development and especially the process of accepting contributions much easier.

Now that we have the new, by default, energy conserving loops, I realize that energy conservation
is extremely useful for validation. I think that having tests that check energy conservation and particular
energy values of particular (combinations of) functionality will catch a lot of problems.
The problems is that MD is chaotic and with non energy-conserving setups the divergence is extremely fast.
With energy conservation running 20 steps with nstlist=10, checking the conserved energy + a few terms
would be enough for testing most modules, I think.
We still want some more extended tests, but that could be a separate set.

So setting up a framework for the simple tests should not be too hard.
Then we need to come up with a set of tests and reference values.

Cheers,

Berk

On 02/05/2012 04:56 AM, Roland Schulz wrote:
Hi,

we agreed that we would want to have a test set for 4.6 but so far we haven't made any progress on it (as far as I know). I want to try to get this work started by posting here a list of questions I have about the new test set. Please add your own questions and answer any questions you can (no need to try to answer all questions).

- Why do the current tests fail? Is it only because of different floating point rounding or are there other problems? What's the best procedure to find out why a test fails?
- Which tests should be part of the new test set? 
- Should the current tests all be part of the new test set?
- How should the new test be implemented? Should the comparison with the reference value be done in C (within mdrun), ctest script, python or perl?
- Should the new test execute mdrun for each test? Or should we somehow (e.g. python wrapper or within mdrun) load the binary only once and run many test per execution?
- What are the requirements for the new test set? E.g. how easy should it be to see whats wrong when a test fails? Should the test support being run under valgrind? Other?
- Do we have any other bugs which have to be solved before the test can be implemented? E.g. is the problem with shared libraries solved? Are there any open redmine issues related to the new test set?
- Should we have a policy that everyone who adds a feature also has to provide tests covering those features?
- Should we have a conference call to discuss the test set? If yes when?
- Should we agree that we won't release 4.6 without the test set to give it a high priority?
Roland

--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309



--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].

--
Erik Lindahl <[hidden email]>
Professor of Theoretical & Computational Biophysics
Department of Theoretical Physics & Swedish e-Science Research Center
Royal Institute of Technology, Stockholm, Sweden
Tel1: +46855378029  Tel2: +468164675  Cell: +46734618050


--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Roland Schulz
In reply to this post by Berk Hess-4
Hi,

On Fri, Feb 10, 2012 at 12:45 PM, Erik Lindahl <[hidden email]> wrote:
Hi,

It will certainly be easier to have tests with close-to-perfect conservation.

Yes. It will also help if the test set will be always up to date. Jenkins should guarantee that by forcing developers to change the expected result whenever a different algorithm doesn't have a binary identical result.

However, it's too late in the cycle to decide that we're not going to release anything without tests. We should still do them, but I would like to see two separate parts:
But I think we have to somehow give it a high priority and make sure a large percentage of us developers are contributing to this tasks. I don't think it can be accomplished by just a few people. When a large part of basic code is rewritten in C++, we should already have the tests to know when we create regression bugs. 

1) Low-level tests that specifically check the output for several sets of input for a *module*, i.e. calling routines rather than running a simulation. The point is that this will isolate errors either to a specific module, or to modules it depends on. However, when those modules too should have tests it will be a 5-min job to find what file+routine the bug is likely to be in.
What framework should be used to write these unit tests? Should those be written using GoogleTest as those tests written by Teemu? This would mean that the tests only compile with C++ but I don't think this would be a problem. 

2) Higher-level tests that check whether this feature appears to work in practice in a simulation. The point of these tests is mostly to make sure other new features don't break our module.
How should we run these integration tests? Should we run them similar to how the current test-set is run? I.e have scripts which run pdb2gmx, grompp, mdrun and parse the output for the results and errors. If so do we want to base it onto the existing perl scripts, have some existing external framework or write some new scripts from scratch?

Roland

 
Cheers,

Erik



On Feb 6, 2012, at 4:17 PM, Berk Hess wrote:

Hi,

I agree test sets are very important.
Having good tests will make development and especially the process of accepting contributions much easier.

Now that we have the new, by default, energy conserving loops, I realize that energy conservation
is extremely useful for validation. I think that having tests that check energy conservation and particular
energy values of particular (combinations of) functionality will catch a lot of problems.
The problems is that MD is chaotic and with non energy-conserving setups the divergence is extremely fast.
With energy conservation running 20 steps with nstlist=10, checking the conserved energy + a few terms
would be enough for testing most modules, I think.
We still want some more extended tests, but that could be a separate set.

So setting up a framework for the simple tests should not be too hard.
Then we need to come up with a set of tests and reference values.

Cheers,

Berk

On 02/05/2012 04:56 AM, Roland Schulz wrote:
Hi,

we agreed that we would want to have a test set for 4.6 but so far we haven't made any progress on it (as far as I know). I want to try to get this work started by posting here a list of questions I have about the new test set. Please add your own questions and answer any questions you can (no need to try to answer all questions).

- Why do the current tests fail? Is it only because of different floating point rounding or are there other problems? What's the best procedure to find out why a test fails?
- Which tests should be part of the new test set? 
- Should the current tests all be part of the new test set?
- How should the new test be implemented? Should the comparison with the reference value be done in C (within mdrun), ctest script, python or perl?
- Should the new test execute mdrun for each test? Or should we somehow (e.g. python wrapper or within mdrun) load the binary only once and run many test per execution?
- What are the requirements for the new test set? E.g. how easy should it be to see whats wrong when a test fails? Should the test support being run under valgrind? Other?
- Do we have any other bugs which have to be solved before the test can be implemented? E.g. is the problem with shared libraries solved? Are there any open redmine issues related to the new test set?
- Should we have a policy that everyone who adds a feature also has to provide tests covering those features?
- Should we have a conference call to discuss the test set? If yes when?
- Should we agree that we won't release 4.6 without the test set to give it a high priority?
Roland

--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
<a href="tel:865-241-1537" value="+18652411537" target="_blank">865-241-1537, ORNL PO BOX 2008 MS6309



--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].

--
Erik Lindahl <[hidden email]>
Professor of Theoretical & Computational Biophysics
Department of Theoretical Physics & Swedish e-Science Research Center
Royal Institute of Technology, Stockholm, Sweden
Tel1: <a href="tel:%2B46855378029" value="+46855378029" target="_blank">+46855378029  Tel2: <a href="tel:%2B468164675" value="+468164675" target="_blank">+468164675  Cell: <a href="tel:%2B46734618050" value="+46734618050" target="_blank">+46734618050




--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309

--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Roland Schulz
In reply to this post by Roland Schulz


On Mon, Feb 6, 2012 at 10:17 AM, Berk Hess <[hidden email]> wrote:
Hi,

I agree test sets are very important.
Having good tests will make development and especially the process of accepting contributions much easier.

Now that we have the new, by default, energy conserving loops, I realize that energy conservation
is extremely useful for validation. I think that having tests that check energy conservation and particular
energy values of particular (combinations of) functionality will catch a lot of problems.
The problems is that MD is chaotic and with non energy-conserving setups the divergence is extremely fast.
With energy conservation running 20 steps with nstlist=10, checking the conserved energy + a few terms
would be enough for testing most modules, I think.
We still want some more extended tests, but that could be a separate set.

So setting up a framework for the simple tests should not be too hard.
What tests do you have in mind for these simple tests? Would these be all integration tests? 

Roland



On 02/05/2012 04:56 AM, Roland Schulz wrote:
Hi,

we agreed that we would want to have a test set for 4.6 but so far we haven't made any progress on it (as far as I know). I want to try to get this work started by posting here a list of questions I have about the new test set. Please add your own questions and answer any questions you can (no need to try to answer all questions).

- Why do the current tests fail? Is it only because of different floating point rounding or are there other problems? What's the best procedure to find out why a test fails?
- Which tests should be part of the new test set? 
- Should the current tests all be part of the new test set?
- How should the new test be implemented? Should the comparison with the reference value be done in C (within mdrun), ctest script, python or perl?
- Should the new test execute mdrun for each test? Or should we somehow (e.g. python wrapper or within mdrun) load the binary only once and run many test per execution?
- What are the requirements for the new test set? E.g. how easy should it be to see whats wrong when a test fails? Should the test support being run under valgrind? Other?
- Do we have any other bugs which have to be solved before the test can be implemented? E.g. is the problem with shared libraries solved? Are there any open redmine issues related to the new test set?
- Should we have a policy that everyone who adds a feature also has to provide tests covering those features?
- Should we have a conference call to discuss the test set? If yes when?
- Should we agree that we won't release 4.6 without the test set to give it a high priority?
Roland

--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
<a href="tel:865-241-1537" value="+18652411537" target="_blank">865-241-1537, ORNL PO BOX 2008 MS6309






--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309

--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Roland Schulz
In reply to this post by Roland Schulz


On Sun, Feb 5, 2012 at 3:53 PM, Shirts, Michael (mrs5pt) <[hidden email]> wrote:
Hi, all-

My opinion:

I think there should probably be two classes of sets -- fast fully automated
sets, and more sophisticated content validation sets.

For the fast fully automated test, I would suggest:
-testing a large range of input .mdps, tops and gros for whether they run
through grompp and mdrun. Not performing whether the output is correct or
not, because that is very hard to automate -- just testing whether it runs
for, say 20-100 steps or so.

Yes having that set of inputs is needed. Should we start a wiki page to start listing all the inputs we want to include? Or what would be the best way to collaborative creating this set of inputs?
 

Longer term, we should look more at validating code at a physical level.
Clearly testing energy conservation is a good idea for integrators; it's
fairly sensitive.  I think we need to discuss a bit more about how to
evaluate energy conservation.  This actually can take a fair amount of time,
and I'm thinking this is something that perhaps should wait for 5.0.  For
thermostats and barostats, I'm working on a very sensitive test of ensemble
validity. I'll email a copy to the list when it's ready to go (1-2 weeks?),
and this is something that can also be incorporated in an integrated testing
regime, but again, these sorts of tests will take multiple hours, not
seconds.   That sort of testing level can't be part of the day to day build.
Well even if the tests take ~2000CPU-hours, I think we (maybe even Erik by himself) have the resources to run this weekly.


> - What are the requirements for the new test set? E.g. how easy should it
> be to see whats wrong when a test fails?

For the first set of tests, I can imagine that it would be nice to be able
to look at the outputs of the tests, and diff different outputs
corresponding to different code versions to help track down changes were.
But I'm suspicious about making the evaluation of these tests decided on
automatically at first.  I agree such differences should EVENTUALLY be
automated, but I'd prefer several months of investigation and discussion
before deciding exactly what "correct" is.

I think a wrong reference value is better than no reference value. Even a wrong reference value would allow us to detect if e.g. different compilers give significant different results (maybe some give the correct value). Also it would help to avoid adding additional bugs. Of course we shouldn't release the test set to the outside before we are relative sure that it actually correct.
 
> Should the test support being run
> under valgrind? Other?

Valgrind is incredibly slow and can fail for weird reasons -- I'm not sure
it would add much to do it under valgrind.
I have the current C++ tests (those written by Teemu) running under valgrind in Jenkins. It wasn't very hard to write a few suppression rules to make valgrind not report any false positives. Now Jenkins can automatically fail the build if the code has any memory errors. Obviously one woudn't run any of the long running tests with valgrind. But for the unit tests I think it might be very useful to catch bugs.

 
I DON'T think we should have any test set that starts to look at more
complicated features right now -- it will take months to get that working,
and we need to get 4.6 out of the door on the order of weeks, so we can move
on to the next improvements.  4.6 doesn't have to be perfectly flawless, as
long as it's closer to perfect than 4.5.

My reason for delaying the 4.6 release would not be to improve the 4.6 release. I agree with you we probably can't guarantee that the reference value are correct in time anyhow, so we probably wouldn't even want to ship the tests with 4.6. My worry is that as soon as 4.6 is out the focus is on adding new cool features instead of working on these boring tasks we should do, because they help us in the long run. E.g. if we would have agreed that we don't have a 4.6 release, the C++ conversion would most likely be much further along. And I don't see how we can create an incentive mechanism to work on these issues without somehow coupling it to releases.

Roland

 

Best,
~~~~~~~~~~~~
Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
[hidden email]
<a href="tel:%28434%29-243-1821" value="+14342431821">(434)-243-1821



--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].







--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309

--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Shirts, Michael (mrs5pt)
In reply to this post by Roland Schulz
> A while ago I talked about this with Nicu who works with Siewert Jan
> Marrink. He is a software engineer by training and suggested the
> following. For each import parameter you take extreme values (e.g.
> timestep 5 fs and 0.001 fs) and a random value in between.

I'm not sure this make sense, because lots of things will fail at 5 fs that
SHOULD fail (Though perhaps we need to make the failure more graceful).  And
it's not clear that simulations run between 1 and 0.001 fs add any
information.  So I think that each parameter is a bit more complicated.

Best,
~~~~~~~~~~~~
Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
[hidden email]
(434)-243-1821



--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Shirts, Michael (mrs5pt)
In reply to this post by Roland Schulz

> Yes having that set of inputs is needed. Should we start a wiki page to
> start listing all the inputs we want to include? Or what would be the best
> way to collaborative creating this set of inputs?

A wiki page is good.  I can commit to spending an hour or so this weekend
discussing each parameter and what issues can come up. I can start putting
together a variety of .mdp and .top files.

To what extent can we start with the older test suite, and modify it?

> Well even if the tests take ~2000CPU-hours, I think we (maybe even Erik by
> himself) have the resources to run this weekly.

We can definitely come up with a large number if physical tests that can
come in under that amount of time.

>> For the first set of tests, I can imagine that it would be nice to be able
>> to look at the outputs of the tests, and diff different outputs
>> corresponding to different code versions to help track down changes were.
>> But I'm suspicious about making the evaluation of these tests decided on
>> automatically at first.  I agree such differences should EVENTUALLY be
>> automated, but I'd prefer several months of investigation and discussion
>> before deciding exactly what "correct" is.
>>
>
> I think a wrong reference value is better than no reference value. Even a
> wrong reference value would allow us to detect if e.g. different compilers
> give significant different results (maybe some give the correct value).
> Also it would help to avoid adding additional bugs. Of course we shouldn't
> release the test set to the outside before we are relative sure that it
> actually correct.

I'm just saying that we shouldn't, at first, have automatic failures if the
reference values change.  I very much agree with SAVING the results with
each build so that changes can be tracked down.
 
> I have the current C++ tests (those written by Teemu) running under
> valgrind in Jenkins. It wasn't very hard to write a few suppression rules
> to make valgrind not report any false positives. Now Jenkins
> can automatically fail the build if the code has any memory errors.
> Obviously one woudn't run any of the long running tests with valgrind. But
> for the unit tests I think it might be very useful to catch bugs.

In that case, running some subset of the cases under valgrid certainly makes
sense.

> My reason for delaying the 4.6 release would not be to improve the 4.6
> release. I agree with you we probably can't guarantee that the reference
> value are correct in time anyhow, so we probably wouldn't even want to ship
> the tests with 4.6. My worry is that as soon as 4.6 is out the focus is on
> adding new cool features instead of working on these boring tasks we should
> do, because they help us in the long run. E.g. if we would have agreed that
> we don't have a 4.6 release, the C++ conversion would most likely be much
> further along. And I don't see how we can create an incentive mechanism to
> work on these issues without somehow coupling it to releases.

But if we talk about releases without deciding on anything, then everybody
keeps developing new stuff.  At some point, we need to agree on what
conditions 4.6 will statisfy, and get it out the door, so everyone that is
developing new stuff has to do it within the context of master.

Best,
~~~~~~~~~~~~
Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
[hidden email]
(434)-243-1821

--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Shirts, Michael (mrs5pt)
In reply to this post by Erik Lindahl
> 1) Low-level tests that specifically check the output for several sets of
> input for a *module*, i.e. calling routines rather than running a simulation.
> The point is that this will isolate errors either to a specific module, or to
> modules it depends on. However, when those modules too should have tests it
> will be a 5-min job to find what file+routine the bug is likely to be in.

I'm not exactly sure how this works.  If we test modules, we have to be
writing a bunch of new code that interacts with the modules directly, and so
we may miss things that happen in actual simulation cases.  I sort of favor
just actually running grompp and mdrun, because the errors that occur will
be the errors that people actually see. I haven't found an error yet that is
particularly hard to isolate to a given file pretty quickly once it is
identified.   Perhaps for particular aspects things (testing that dozens of
inner loops give consistent numbers) this makes sense, but I'm not sure it
makes sense for everything.  I'm not sure how you _just_ test pressure
control, for example.

Best,
~~~~~~~~~~~~
Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
[hidden email]
(434)-243-1821



--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

David van der Spoel
In reply to this post by Shirts, Michael (mrs5pt)
On 2012-02-11 17:22, Shirts, Michael (mrs5pt) wrote:

>> A while ago I talked about this with Nicu who works with Siewert Jan
>> Marrink. He is a software engineer by training and suggested the
>> following. For each import parameter you take extreme values (e.g.
>> timestep 5 fs and 0.001 fs) and a random value in between.
>
> I'm not sure this make sense, because lots of things will fail at 5 fs that
> SHOULD fail (Though perhaps we need to make the failure more graceful).  And
> it's not clear that simulations run between 1 and 0.001 fs add any
> information.  So I think that each parameter is a bit more complicated.
>
That's beside the point. The point is there are 100^10 different mdp
parameter combinations. The proposed setup makes that all settings are
tested, although obviously not all combinations.

And about time steps, we previously had problems where a shorter time
step gave worse energy conservation (also beside the point).



> Best,
> ~~~~~~~~~~~~
> Michael Shirts
> Assistant Professor
> Department of Chemical Engineering
> University of Virginia
> [hidden email]
> (434)-243-1821
>
>
>


--
David van der Spoel, Ph.D., Professor of Biology
Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone: +46184714205.
[hidden email]    http://folding.bmc.uu.se
--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Shirts, Michael (mrs5pt)
> That's beside the point. The point is there are 100^10 different mdp
> parameter combinations. The proposed setup makes that all settings are
> tested, although obviously not all combinations.

Agreed.  But picking extremes and random values is not necessarily the way
to do it, because of the 100^10, maybe 0.4995*100^10 are SUPPOSED to crash,
another 0.4995*10^10 are essentially duplicating each other.  Which leaves
100^8 options to test, but random/extreme selection won't necessarily pick
out the right fraction.

A variety of this will likely work work is for EACH parameter, pick out the
1-5 options that really represent something different and should give valid
results, then randomly select from the 100^5 combinations, with rules to
filter out invalid combinations.

I can't seem to edit the wiki, so I can't start a page on this, but perhaps
we should start with the mdout.mdp, and start dissecting exactly what
combinations of options need to be tested, with which different topologies,
and see how many combinations we actually get.

Roland (or other person), perhaps under "Building and Testing" there should
be a third section where we discuss this?

> And about time steps, we previously had problems where a shorter time
> step gave worse energy conservation (also beside the point).

But that will require longer physical simulations to detect.

Best,
~~~~~~~~~~~~
Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
[hidden email]
(434)-243-1821



--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

David Bowman
In reply to this post by Roland Schulz
Hi
I have a few comments and suggestions that I hope will help. I have been designing software and building products for 30 years in industry in the area of compilers, optimizers, databases, imaging and other software where testing is a critical issue. In many cases this involved thousands or even millions of copies. I have been working on performance improvements to GROMACS and the difficulties of knowing if I broke something has been a big issue. 

1) Unit and system test suites are developed as products go through release cycles and as bugs are being fixed. Trying to build tests after a release or near the end of a release cycle does not work because of limitations on resources, the desire to get the release out or because of work on future releases.

2) The important thing for 4.6 is to catch the big things and priority 1 bugs through the GROMACS community before it goes out and put in place a simple manageable process going forward for testing. I would not release it with a priority 1 bug. (e.g. if Reaction Field is broken I would not release 4.6).

3) If conventions for a test framework can quickly be established and the developers of 4.6 functionality and recent critical bug fixes have tests that they have used for their work that may reasonably be adapted to the new test framework they should be added.

4) The foundation of test suites is the unit/module level. If the base is not sound then the system/simulation will not be sound. At the module level it is almost always possible to determine what the 'right' answer is with relatively small datasets in short amounts of time for many parameter sets. A set of pass/fail test programs at module level is a good long term goal. It seems unrealistic to me to try to go back and build module level tests for GROMACS with limited resources, money and time. It is more realistic to add tests as bugs are fixed and new features are added.

5) Keep the test framework very simple. There are number of 'test environments' and products available. I have used a lot of them in commercial product development  I would not recommend the use of any of them for GROMACS.

The US National Institute of Standards and Technology has test suites for compilers, databases and many other software technologies consisting of numerous programs that test functionality at a module level where each test program performs multiple tests sequentially, logs the results as text and returns a program exit indicating  pass/fail. Test programs are typically grouped by functional area and run as a set of scripts. The tests that fail are clearly identified in the log and reading the test program code is simple because the tests are executed sequentially. Sometimes there are input/output files that are validated against a reference file. This simple model is used extensively for ANSI Standards compliance.  Using this model unit/module level testing would be easy and application/simulation level tests could be run as scripts with diffs with configurable 'tolerances' with the same style of logging and pass/fail return method.

What would be needed: Test program and test script templates, conventions for test logging, individual program/script exit (pass/fail) , conventions for dataset naming and configurable tolerances for higher level application/simulation tests. The process also would need to be managed.
 
6) Decisions about how to test functionality at the module level in industry is distributed among the developers that work to build initial or new features and to fix bugs. For future releases of GROMACS developers should be required to add regression test programs that will verify on a pass/fail basis all new functionality and priority 1 bug fixes.

Developers usually have the basic logic of such test programs in their own test programs, simulations, files and criteria that they use to validate their own development. Their test work is usually thrown away. It should not be too time consuming to use this code as the basis of regression tests in the future if developers understand that it is a requirement for code submission. Tests must be  designed  and the developers implementing new features and fixing bugs should take primary responsibility for how to test changes. Unfortunately this takes time, costs money, slows the development process and publication process.
 
7) It is important to try to get a second opinion about how something important is to be fixed or a new feature is to be added and have someone 'sign off' on another person's work and test program/script prior to incorporation into a release. It would be good to integrate this 'signoff' into the process.

Regards,
David
 
 
   
 
On Sat, Feb 11, 2012 at 7:35 AM, David van der Spoel <[hidden email]> wrote:
On 2012-02-10 20:17, Roland Schulz wrote:


On Sun, Feb 5, 2012 at 3:53 PM, Shirts, Michael (mrs5pt)
<[hidden email] <mailto:[hidden email]>>

wrote:

   Hi, all-

   My opinion:

   I think there should probably be two classes of sets -- fast fully
   automated
   sets, and more sophisticated content validation sets.

   For the fast fully automated test, I would suggest:
   -testing a large range of input .mdps, tops and gros for whether
   they run
   through grompp and mdrun. Not performing whether the output is
   correct or
   not, because that is very hard to automate -- just testing whether
   it runs
   for, say 20-100 steps or so.


Yes having that set of inputs is needed. Should we start a wiki page to
start listing all the inputs we want to include? Or what would be the
best way to collaborative creating this set of inputs?

A while ago I talked about this with Nicu who works with Siewert Jan Marrink. He is a software engineer by training and suggested the following. For each import parameter you take extreme values (e.g. timestep 5 fs and 0.001 fs) and a random value in between. Then there would be N^3 different parameter combinations for N parameters which probably is way too many combination, even if N would be only 20. Therefore you now pick a subset of, say 200 or 1000 out of these N^3 possible tests, and this becomes the test set. With such a set-up it is quite easy to see that we'd test at least the extreme value which are possible where things can go wrong. A few of these tests would actually be prohibited by grompp too, but in all likelihood not nearly enough.

At the time when Nicu & I discussed this we even considered publishing this, since I am not aware of another scientific code that has such rigorous testing tools.



   Longer term, we should look more at validating code at a physical level.
   Clearly testing energy conservation is a good idea for integrators; it's
   fairly sensitive.  I think we need to discuss a bit more about how to
   evaluate energy conservation.  This actually can take a fair amount
   of time,
   and I'm thinking this is something that perhaps should wait for 5.0.
     For
   thermostats and barostats, I'm working on a very sensitive test of
   ensemble
   validity. I'll email a copy to the list when it's ready to go (1-2
   weeks?),
   and this is something that can also be incorporated in an integrated
   testing
   regime, but again, these sorts of tests will take multiple hours, not
   seconds.   That sort of testing level can't be part of the day to
   day build.

Well even if the tests take ~2000CPU-hours, I think we (maybe even Erik
by himself) have the resources to run this weekly.



    > - What are the requirements for the new test set? E.g. how easy
   should it
    > be to see whats wrong when a test fails?

   For the first set of tests, I can imagine that it would be nice to
   be able
   to look at the outputs of the tests, and diff different outputs
   corresponding to different code versions to help track down changes
   were.
   But I'm suspicious about making the evaluation of these tests decided on
   automatically at first.  I agree such differences should EVENTUALLY be
   automated, but I'd prefer several months of investigation and discussion
   before deciding exactly what "correct" is.


I think a wrong reference value is better than no reference value. Even
a wrong reference value would allow us to detect if e.g. different
compilers give significant different results (maybe some give the
correct value). Also it would help to avoid adding additional bugs. Of
course we shouldn't release the test set to the outside before we are
relative sure that it actually correct.

    > Should the test support being run
    > under valgrind? Other?

   Valgrind is incredibly slow and can fail for weird reasons -- I'm
   not sure
   it would add much to do it under valgrind.

I have the current C++ tests (those written by Teemu) running under
valgrind in Jenkins. It wasn't very hard to write a few suppression
rules to make valgrind not report any false positives. Now Jenkins
can automatically fail the build if the code has any memory errors.
Obviously one woudn't run any of the long running tests with valgrind.
But for the unit tests I think it might be very useful to catch bugs.

   I DON'T think we should have any test set that starts to look at more
   complicated features right now -- it will take months to get that
   working,
   and we need to get 4.6 out of the door on the order of weeks, so we
   can move
   on to the next improvements.  4.6 doesn't have to be perfectly
   flawless, as
   long as it's closer to perfect than 4.5.


My reason for delaying the 4.6 release would not be to improve the 4.6
release. I agree with you we probably can't guarantee that the reference
value are correct in time anyhow, so we probably wouldn't even want to
ship the tests with 4.6. My worry is that as soon as 4.6 is out the
focus is on adding new cool features instead of working on these boring
tasks we should do, because they help us in the long run. E.g. if we
would have agreed that we don't have a 4.6 release, the C++ conversion
would most likely be much further along. And I don't see how we can
create an incentive mechanism to work on these issues without somehow
coupling it to releases.

Roland


   Best,
   ~~~~~~~~~~~~
   Michael Shirts
   Assistant Professor
   Department of Chemical Engineering
   University of Virginia
   [hidden email] <mailto:[hidden email]>
   <a href="tel:%28434%29-243-1821" value="+14342431821" target="_blank">(434)-243-1821 <tel:%28434%29-243-1821>



   --
   gmx-developers mailing list
   [hidden email] <mailto:[hidden email]>

   http://lists.gromacs.org/mailman/listinfo/gmx-developers
   Please don't post (un)subscribe requests to the list. Use the
   www interface or send it to [hidden email]
   <mailto:[hidden email]>.







--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <http://cmb.ornl.gov>

<a href="tel:865-241-1537" value="+18652411537" target="_blank">865-241-1537, ORNL PO BOX 2008 MS6309




--
David van der Spoel, Ph.D., Professor of Biology
Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone:  <a href="tel:%2B46184714205" value="+46184714205" target="_blank">+46184714205.
[hidden email]    http://folding.bmc.uu.se

--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the www interface or send it to [hidden email].


--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Rossen Apostolov-3
In reply to this post by Shirts, Michael (mrs5pt)
Hi Michael,

On 2012-02-11 18.16, Shirts, Michael (mrs5pt) wrote:
> I can't seem to edit the wiki, so I can't start a page on this,
You should be able to do it now.

Cheers,
Rossen
--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

Roland Schulz
In reply to this post by Erik Lindahl


On Sat, Feb 11, 2012 at 11:31 AM, Shirts, Michael (mrs5pt) <[hidden email]> wrote:
> 1) Low-level tests that specifically check the output for several sets of
> input for a *module*, i.e. calling routines rather than running a simulation.
> The point is that this will isolate errors either to a specific module, or to
> modules it depends on. However, when those modules too should have tests it
> will be a 5-min job to find what file+routine the bug is likely to be in.

I'm not exactly sure how this works.  If we test modules, we have to be
writing a bunch of new code that interacts with the modules directly, and so
we may miss things that happen in actual simulation cases.  I sort of favor
just actually running grompp and mdrun, because the errors that occur will
be the errors that people actually see. I haven't found an error yet that is
particularly hard to isolate to a given file pretty quickly once it is
identified.   Perhaps for particular aspects things (testing that dozens of
inner loops give consistent numbers) this makes sense, but I'm not sure it
makes sense for everything.  I'm not sure how you _just_ test pressure
control, for example.
One could read virial values from a data file and pass it to the pressure control. And then check that the pressure control is behaving as it should. 

I think unit tests would be very useful for the planned conversion to C++. We will have to convert individual modules one at a time and I assume that the overall program will be broken often. In that case unit tests might enable us to test individual modules without the overall programming working. 

Roland
 

Best,
~~~~~~~~~~~~
Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
[hidden email]
<a href="tel:%28434%29-243-1821" value="+14342431821">(434)-243-1821



--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].







--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309

--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: New Test Set

David Bowman
Roland, Erik, et.al.
I am not sure if my posting was well understood so I will give an example.
 
When one has an extremely complex piece of software that has few tests and there is a need to  be constantly adding new features there are only a couple of ways of testing it. It can not be well tested at the higher levels only at the module level.
 
For example, I would liken the testing of GROMACS to a high level language compiler designed with complex syntax, many modules and many options. Compilers have language features, syntax that require runtime support and finction libraries. They also have supporting library routines or support programs. No one would ever try to make a test suite for high level use cases. What is done is to create separate test programs that exercise language syntax that can be tested without direct calls to internal modules or directly to library routines (e.g. nonbonded, etc.). Separate tests programs are written for things like object file creation. These tests have 'right' answers. No attempt is made to test 'use cases' except in the case of priority 1 bugs. This reduces the scope of testing to a manageable and defined level. It requires breaking the program into functional groups, internal routines and external routines and support programs. This is not difficult to do with GROMACS. GROMACS has .mdp options that amount to 'lanaguage' syntax, internal code routines, external programs and file formats like a compiler. This is the test approach that I think should be taken.
 
In cases where I have worked on compilers that were 'legacy' and mature products we prioritized the removal of all serious P1 and P2 bugs and adding new features. When new features were added developers started adding higher level tests. In response to serious P1 bugs in 'broken generated programs' 'e.g. simulations' user provided code (or simuations) were added to the test suite. Onces I inherited such a compiler. We had many new features to add, many bugs, little documenation, virtually no test suite and a short time to release. With 5 people we added all the new features in 5 months, fixed all the priority 1 and 2  bugs, and built test programs that exercised or called directly all key core modules.
 
I hope this helps.
 
Cheers
David
 
 Sat, Feb 11, 2012 at 7:45 PM, Roland Schulz <[hidden email]> wrote:


On Sat, Feb 11, 2012 at 11:31 AM, Shirts, Michael (mrs5pt) <[hidden email]> wrote:
> 1) Low-level tests that specifically check the output for several sets of
> input for a *module*, i.e. calling routines rather than running a simulation.
> The point is that this will isolate errors either to a specific module, or to
> modules it depends on. However, when those modules too should have tests it
> will be a 5-min job to find what file+routine the bug is likely to be in.

I'm not exactly sure how this works.  If we test modules, we have to be
writing a bunch of new code that interacts with the modules directly, and so
we may miss things that happen in actual simulation cases.  I sort of favor
just actually running grompp and mdrun, because the errors that occur will
be the errors that people actually see. I haven't found an error yet that is
particularly hard to isolate to a given file pretty quickly once it is
identified.   Perhaps for particular aspects things (testing that dozens of
inner loops give consistent numbers) this makes sense, but I'm not sure it
makes sense for everything.  I'm not sure how you _just_ test pressure
control, for example.
One could read virial values from a data file and pass it to the pressure control. And then check that the pressure control is behaving as it should. 

I think unit tests would be very useful for the planned conversion to C++. We will have to convert individual modules one at a time and I assume that the overall program will be broken often. In that case unit tests might enable us to test individual modules without the overall programming working. 

Roland
 

Best,
~~~~~~~~~~~~
Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
[hidden email]
<a href="tel:%28434%29-243-1821" target="_blank" value="+14342431821">(434)-243-1821




--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].







--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
<a href="tel:865-241-1537" target="_blank" value="+18652411537">865-241-1537, ORNL PO BOX 2008 MS6309

--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].


--
gmx-developers mailing list
[hidden email]
http://lists.gromacs.org/mailman/listinfo/gmx-developers
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [hidden email].
Loading...