Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 4-site water models #70

Open
mattwthompson opened this issue Feb 21, 2025 · 1 comment
Open

Support 4-site water models #70

mattwthompson opened this issue Feb 21, 2025 · 1 comment
Labels
feature New feature or request

Comments

@mattwthompson
Copy link
Member

Is your feature request related to a problem?

I sometimes encounter PDB files with the dummy atoms/virtual sites of water included. For example:

ATOM  11750  O   WAT  2896     -13.545 -10.314 -15.861  1.00  0.00
ATOM  11751  H1  WAT  2896     -14.018  -9.482 -15.868  1.00  0.00
ATOM  11752  H2  WAT  2896     -14.076 -10.896 -16.405  1.00  0.00
ATOM  11753 EPW  WAT  2896     -13.674 -10.283 -15.931  1.00  0.00

Describe the solution you'd like

Be able to load these

Describe alternatives you've considered

Yell at my computer

Additional context

@j-wags
Copy link
Member

j-wags commented Feb 24, 2025

Relevant discussion from internal slack channel

Matt Thompson

any hope of loading a PDB with 4-site water?
ATOM 11750 O WAT 2896 -13.545 -10.314 -15.861 1.00 0.00
ATOM 11751 H1 WAT 2896 -14.018 -9.482 -15.868 1.00 0.00
ATOM 11752 H2 WAT 2896 -14.076 -10.896 -16.405 1.00 0.00
ATOM 11753 EPW WAT 2896 -13.674 -10.283 -15.931 1.00 0.00

Josh Mitchell

Ooh good question. We'd just want to throw away the virtual site right? This would be a relatively easy feature to add I think, could have a dummy_atom attribute on AtomDefinition that would match the same way as a normal atom but then not get included in the topology

Or just a list of names in an attribute on ResidueDefinitions whose records are discarded when they're encountered

Matt Thompson

I think we’d want to trash it, since the toolkit doesn’t represent things that aren’t atoms
MDTraj and OpenMM just let you have an “atom” with an atomic number of 0, which is something I advocated the toolkit not allow and I stand by this even if it causes people headaches (edited)

In this case I think there’s i.e. some re-indexing that would need to happen which might cause some things to get hairy

Josh Mitchell

The original serial number and pdb index would still be stored in atom metadata for the remaining atoms, so it wouldn't be too bad i don't think

Matt Thompson

I also assume an answer to the question is “not right now no lol”

Josh Mitchell

Yeah would definitely need some new code in Pablo. Could you add an issue? It would be a great feature that would be very easy to add

Chapin Cavender

Throwing away the virtual sites is how I do this for the protein benchmarks with 4-site water.
https://github.com/openforcefield/proteinbenchmark/blob/9d6b2828a98d1d45e265cf75ca4b61de5eb60b38/proteinbenchmark/system_setup.py#L596-L632

system_setup.py
if water_model in {"opc", "tip4p-fb"}:
# Create a new OpenMM topology without water virtual sites
no_vsite_topology = app.Topology()
no_vsite_topology.setPeriodicBoxVectors(
openmm_topology.getPeriodicBoxVectors()
Show more
https://github.com/[openforcefield/proteinbenchmark](https://github.com/openforcefield/proteinbenchmark?email_source=slack)?email_source=slack|openforcefield/proteinbenchmarkopenforcefield/proteinbenchmark | Added by GitHub

Jeffrey Wagner

I’m not sure we should support this at all. Silently throwing away any data from a PDB file seems really dangerous. Especially if there are existing tools in our ecosystem that handle this with already defined behavior, it may be best for users to rely on things like Chapin’s workaround rather than introduce dangerous behavior into our own ecosystem.

Josh Mitchell

We already throw away tonnes of information:

  • All records except CRYST1, CONECT, HETATM, TER, MODEL, and ENDMDL (including useful info like secondary structure, anisotropic b factors...)
  • Alt ids other than '' and 'A'
  • Models other than the first
  • Formal charges (because in practice they're unreliable)
    We could very easily support virtual sites in a residue definition, but not actually provide any residues that have them - this would allow the user to load residues with virtual sites only when explicitly desired by specifying their own residues. There could be a very simple api point like "CcdCache.with_dummy_atoms(resname, atomnames)"

Matt Thompson

I should be able get my little toy notebook with a band-aid like Chapin’s solution. I think long-term it’ll be necessary to take this case seriously. I see tradeoffs with automatically stripping them out or erroring out at the first EPW - whatever is done I think it should be done carefully since this is not an esoteric example

Jennifer A Clark

I'm not sure about virtual sites for molecules other than TIP4P or with simulation engines other than LAMMPS, but my experience with implementing TIP4P in LAMMPS is that I only need the 3 sites and the 4th is defined by the pair-style so not only is it redundant to include but also an extra step because I need to remove it.

Jeffrey Wagner

I’m still hesitant about throwing out some ATOM/HETATM lines but keeping others, but I do see the value to users. I’d be interested to see a prototype of Josh’s proposed workaround that

  • throws away the vsites
  • requires users to put something non-default in the loading command that acknowledges that vsites are in the input and will be thrown away

Matt Thompson

maybe some missing context here is that we made a design decision a while back to not include virtual sites in the toolkit’s representation. The short of it is that the toolkit is for representing atoms and molecules and virtual sites aren’t well-defined without a force field, so they should exist only after parametrization. Interchange has a lot of sludge for bridging the toolkit topology and force field information into structures that engines like, which unfortunately is not at all standardized so it’s hard to deal with in a consistent way. I haven’t looked into this much for LAMMPS, but OpenMM, GROMACS, and (I think) Amber each add virtual sites as discrete particles into their topological representations, so the “post-parametrization” state (Interchange and various things hidden inside of it) necessarily represents things differently than the “pre-parametrization” state (openff.toolkit.Topology and ForceField). It’s messy but tractable, and at least isolates the complexity to Interchange and the interoperability chaos
This puts the Pablo developers in a tough situation when virtual sites are shoved into PDB files (the scope of the question here - just loading a PDB file I found that lists out water with 4 particles per molecule): they can’t really be loaded into the toolkit since there’s nowhere for them to go, it’s iffy to ignore them since they can hold some information and re-indexing etc. is a headache, and just erroring out on those files reduces the functionality of the tool. I’m not pushing for a course change in the current project plan, but this is feedback that I hope will be integrated into future plans

@Yoshanuikabundi Yoshanuikabundi added feature New feature or request and removed feature New feature or request labels Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants