You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parse CONECT records in PDB files to extract explicit bond connectivity for HETATM molecules (ligands, cofactors, etc.)
Store explicit bonds on Data::Geometry and use them in the layer factory instead of OpenBabel distance-based bond perception
Deduplicate symmetric CONECT entries (PDB specifies each bond in both directions)
Problem
The PDB parser ignored CONECT records entirely. Bond connectivity for ligands and non-protein HETATM molecules was inferred purely from interatomic distances via OpenBabel's ConnectTheDots. This produces incorrect bond orders and
topologies for metal complexes, unusual geometries, and molecules where distance alone is ambiguous.
Changes
src/Data/Geometry.h / Geometry.C
Added addBond(int i, int j) — stores a normalised (i < j), deduplicated bond by 0-based atom index
Added hasBonds() and bonds() accessors
src/Parser/PdbParser.h
Added m_serialToAtom: maps PDB atom serial number → (Geometry*, atom index) for HETATM atoms
Added m_conectPairs: accumulates raw serial-number pairs from CONECT lines
src/Parser/PdbParser.C
Parses atom serial number (cols 7–11) for each ATOM/HETATM record
Records the serial→atom mapping before appending each non-water HETATM atom
New CONECT branch collects up to four bonded serials per line
After the main parse loop, resolves serial pairs into Geometry::addBond calls; cross-geometry bonds (e.g. protein–ligand) are skipped as they cannot be represented in the current layer model
src/Layer/LayerFactory.C
When geometry.hasBonds(), adds explicit OB bonds and calls PerceiveBondOrders only (skips ConnectTheDots)
Falls back to the existing distance-based perception when no explicit bonds are present
Limitations / Future Work
CONECT records that reference ATOM serial numbers (e.g. disulfide bonds) are not yet resolved — protein chain atoms are not tracked in m_serialToAtom
Cross-geometry bonds (protein–ligand covalent attachments) are silently ignored
Bond order from CONECT is not encoded (PDB format does not carry it); OpenBabel's PerceiveBondOrders is still used to assign orders once topology is fixed
⏺ Here's a prioritized breakdown of improvements, roughly in order of impact:
High Impact
CONECT record parsing
The parser currently ignores CONECT records entirely. This means bond connectivity for ligands and HETATM molecules is never read — bonds are probably inferred by distance instead, which is error-prone for metal complexes, unusual
geometries, and disulfide bridges. PdbParser.C has no CONECT branch at all.
MODEL/ENDMDL support for trajectories
NMR structures and MD snapshots use MODEL/ENDMDL blocks. The parser only breaks on END/ENDMDL but doesn't track model indices, so multi-model files collapse into one geometry. Alanine-path.pdb in the test suite documents this limitation.
Occupancy and B-factor storage
These fields are read off the line but never stored or propagated. B-factors are essential for visualising crystallographic quality and disorder; occupancy is needed for correctly handling partially-occupied sites.
Medium Impact
Insertion code handling
PDB residue numbering can include insertion codes (e.g. 47A, 47B) to avoid renumbering legacy structures. The parser ignores the iCode column (col 27), which can cause residues with insertion codes to collide or be dropped.
SSBOND records
Disulfide bonds are declared explicitly in SSBOND records but are not parsed — they'd need to be inferred from CONECT or distance, which is less reliable.
Alternate conformers
Only conformer 'A' is kept (if (!alternateLocation.isEmpty() && alternateLocation != 'A') continue). Supporting multiple conformers would be useful for NMR ensembles and crystal structures with disorder.
Lower Impact / Housekeeping
Remove the SecStruc.dat side-effect — writing a file to the same directory as the input is surprising behaviour and will silently fail on read-only paths.
Remove qDebug() from production code — there's a qDebug() << "!!!Secondary structure order..." left in the secondary-structure pass (line 315 of PdbParser.C).
Metadata extraction — HEADER, TITLE, EXPDTA records are skipped. Surfacing the PDB entry ID, title, and experimental method would improve file identification.
Better error reporting — the parser uses a goto error pattern with minimal messages; structured validation errors would help diagnose malformed files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CONECTrecords in PDB files to extract explicit bond connectivity for HETATM molecules (ligands, cofactors, etc.)Data::Geometryand use them in the layer factory instead of OpenBabel distance-based bond perceptionProblem
The PDB parser ignored
CONECTrecords entirely. Bond connectivity for ligands and non-protein HETATM molecules was inferred purely from interatomic distances via OpenBabel'sConnectTheDots. This produces incorrect bond orders andtopologies for metal complexes, unusual geometries, and molecules where distance alone is ambiguous.
Changes
src/Data/Geometry.h/Geometry.CaddBond(int i, int j)— stores a normalised (i < j), deduplicated bond by 0-based atom indexhasBonds()andbonds()accessorssrc/Parser/PdbParser.hm_serialToAtom: maps PDB atom serial number →(Geometry*, atom index)for HETATM atomsm_conectPairs: accumulates raw serial-number pairs from CONECT linessrc/Parser/PdbParser.CCONECTbranch collects up to four bonded serials per lineGeometry::addBondcalls; cross-geometry bonds (e.g. protein–ligand) are skipped as they cannot be represented in the current layer modelsrc/Layer/LayerFactory.Cgeometry.hasBonds(), adds explicit OB bonds and callsPerceiveBondOrdersonly (skipsConnectTheDots)Limitations / Future Work
ATOMserial numbers (e.g. disulfide bonds) are not yet resolved — protein chain atoms are not tracked inm_serialToAtomPerceiveBondOrdersis still used to assign orders once topology is fixed