-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize storage reading/writing #429
Comments
I do not use this personally, so no strong opinion either way. I did have a small remark on the proposal:
You probably do not want to use a lossy format like |
Thank you so much for your input, @sroet. I wasn't aware of that, and I think it's a fair concern. We may want to save the checkpoint in We also need to think about how to handle the restraint reweighting which relies on the solute trajectory. A solution that would let us store it in low-precision would be to add the option to compute and store at runtime distances and energies. Actually, this feature was transferred here from YANK with the |
|
That would be ideal if it doesn't become a huge pain to maintain. But you may find that you need to refactor the storage layer API, however, which would require more effort to update the NetCDFStorageLayer. In particular, the current storage layer API has some issues with either (1) being ultra-specific in exposing specific variable interfaces, rather than general, or (2) missing interfaces for some variables. It might be useful to first evaluate whether the storage layer API needs to be refactored before implementing a new storage class. |
One more consideration: Ideally, for maximum performance, data to be written to disk would be offloaded to another thread to flush to disk asynchronously. That will allow the simulation to continue on the GPU while the disk write is pending. We want to do this in a way that allows us to robustly resume from the last checkpoint. |
Thank you very much for the tips, @kyleabeauchamp .
That's a good point. I think I've seen netCDF APIs that allow writing in parallel. I'll check if they present extra difficulties in terms of deployment/maintenance. This may help greatly with the trajectories I/O. For the data that doesn't fit the numerical array type (such as object serialization of thermo states, MCMC moves, and metadata), I'd still suggest a refactoring. Our current serialization strategy to netcdf for generic types is quite limited and cumbersome. One of the advantages I see of splitting the trajectory over multiple files in a common format (including AMBER NetCDF) is that people will be able to just load the replica trajectories in VMD or PyMol instead of having to go through the expensive |
It looks like The only advantage of individual trajectories at this point would be only easy visualization. For reference: https://unidata.github.io/netcdf4-python/netCDF4/index.html#section13 . |
Ah! But conda-forge/libnetcdf is built only with |
I took a look at what it will take to get an mpich variant built on the netcdf4 feedstock. It looks quite easy, and they seem open to external contributions so I'm thinking that right now we just want to leave the format untouched and try parallel writing + playing with the chunk size. I think this will give us a huge speedup for the least effort right now. If there are no objections, I'd close this issue and open a new one about implementing parallel netcdf4. We can always re-open if it turns out we can improve things from there with one of these strategies. |
Can we discuss Monday? I'm not sure parallel NetCDF4 will solve all our problems, and especially not any of the problems of processing NetCDF files to extract trajectories from them afterwards or reducing file sizes. It may not even solve our asynchronous write threads issue either. And it will introduce problems down the road if we move away from mpich as our sole parallelization strategy. |
Sure! I just found out that parallel netcdf doesn't support neither compression nor chunk caching. This might be ok for the checkpoint trajectory, but it might degrade performance for the solute trajectory if only a few MPI processes are used. Because we already know replicas are not going to overwrite each other, separating in multiple files the solute trajectory (xtc or netcdf files with lossy compression) might be a good strategy after all. It would also make it possible to read/write with pure Python threads without incurring in HDF5 locking problems. |
From the usability point of view, one of the comments I have received in my previous projects is that the amount of files generated by the software was too high. This is particularly true with new users - they see lots of files and get overwhelmed because they want to find "their results" easily and (at first) do not care about contextual data meant for further reproducibility. If we anticipate this to be a problem here (for example, one YAML file per MC move) I have found a combination of solutions to mitigate this "shock" factor. With decreasing relevance:
I am aware that this might not be a problem with established users willing to devote some time to understand what they are doing, but it could help those only willing to try out whether this thing works or not. |
Me and @jaimergp had a brief chat. If there are no objections, the current plan is to have @jaimergp implement splitting the solute-only trajectory on multiple EDIT: By "re-evalutating the other possible solutions" I mean re-evaluating whether splitting the objects serialization of, for example, thermo states and mcmc move in multiple files is worth the effort, and whether it is a good idea to use parallel netcdf for the checkpoint trajectory and the other netcdf-stored arrays. |
For the records, we are going to merge #434, which split solute-only trajectories from the main netcdf and saves them into separate |
Are you still working on this? Without parallel NetCDF, the storage file is such a pain in the neck. @jaimergp I can see how storing everything in the same file can be advantageous. At the same time, it complicates more advanced analyses. In general, access to the storage file is too slow. Just a few examples from our recent Hamiltonian replica exchange simulations of membranes.
In my opinion, a lot of things would be much easier if each replica would write its own trajectory file. The energies, mixing statistics, thermodynamic states, (maybe checkpoints), could still live in a (much smaller) netcdf file. This would allow easier access to all information. I would also not mind implementing some of the refactoring if you agree that it would make things better. Or am I missing something obvious? |
I totally agree about the current pain of using the single NetCDF file, and am hoping we can split both the checkpoint files and solute-only files into separate XTC or DCD files, leaving only the smaller numpy arrays in the NetCDF file, without too much pain. Longer term, we would love to switch to some sort of distributed database that can handle multiple calculations streaming to it at once, but we haven't started to design this yet. |
@jaimergp implemented the parallel XTC files and we have now merged them in a separate feature branch ( If you want to try the current state, let me know and I can update it with the new code from
For this, the bulk of the calculation is in imaging the trajectory, I believe. Having parallel xtc files means we'll have to penalize reading the trajectory along a state in favor of replica trajectories. The netcdf file allows instead blocking the file by frame, which means reading state or replica trajectories will be roughly equally expensive so this issue may turn out to be quite complicated performance-wise. |
I will write here some of the ideas we got after our talk with @Olllom. There are performance issues while resuming calculations. All of the replicas accessing the monolithic NetCDF file at the same time becomes a bottleneck. It's not clear if this is due to A) Read operations being blocking, or B) Saturating the IO bandwidth in the machine. @Olllom, could you check this? If option A) ends up being the problem, we could devise a short-term quick "fix" before the DAG-aware refactoring. This could consist of an optional keyword (disabled by default) that would write all the data needed for resuming calculations into separate files (format to be determined), in addition to the NetCDF file. We would make sure that there are no performance problems with this fast-access alternative (e.g. multiple files, memory cache, etc). Would you be ok with this approach, @Olllom? If the IO bandwidth is being saturated, well, I don't think there's much else we could do... |
I'm planning to make some changes in the
Reporter
to split the two monolithicnetcdf
files into more manageable chunks. This is what I'm thinking now:xtc
file per replica.xtc
file per replica.System
) files.metadata
: One or many YAML files.any_numeric_array_of_variable_dimension
): a singlenetcdf
file for all.I think splitting over multiple small files (whose directory structure is hidden by the
Reporter
class) means reading operations will be faster. Moreover, we'll be able to parallelize writing on disk, which is currently a big bottleneck for multi-replica methods.Question: Should we keep the old reporter around and maybe change its name to
NetCDFReporter
to allow reading data generated with previous versions or do we anticipate installing a previous version ofopenmmtools
/yank
will suffice for our needs?The text was updated successfully, but these errors were encountered: