Skip to content

Add CONECT record parsing to PDB parser #37

Open
Leaf-Lin wants to merge 1 commit intonutjunkie:masterfrom
Leaf-Lin:feature/pdb_improve_claude
Open

Add CONECT record parsing to PDB parser #37
Leaf-Lin wants to merge 1 commit intonutjunkie:masterfrom
Leaf-Lin:feature/pdb_improve_claude

Conversation

@Leaf-Lin
Copy link
Copy Markdown

Summary

  • Parse CONECT records in PDB files to extract explicit bond connectivity for HETATM molecules (ligands, cofactors, etc.)
  • Store explicit bonds on Data::Geometry and use them in the layer factory instead of OpenBabel distance-based bond perception
  • Deduplicate symmetric CONECT entries (PDB specifies each bond in both directions)

Problem

The PDB parser ignored CONECT records entirely. Bond connectivity for ligands and non-protein HETATM molecules was inferred purely from interatomic distances via OpenBabel's ConnectTheDots. This produces incorrect bond orders and
topologies for metal complexes, unusual geometries, and molecules where distance alone is ambiguous.

Changes

src/Data/Geometry.h / Geometry.C

  • Added addBond(int i, int j) — stores a normalised (i < j), deduplicated bond by 0-based atom index
  • Added hasBonds() and bonds() accessors

src/Parser/PdbParser.h

  • Added m_serialToAtom: maps PDB atom serial number → (Geometry*, atom index) for HETATM atoms
  • Added m_conectPairs: accumulates raw serial-number pairs from CONECT lines

src/Parser/PdbParser.C

  • Parses atom serial number (cols 7–11) for each ATOM/HETATM record
  • Records the serial→atom mapping before appending each non-water HETATM atom
  • New CONECT branch collects up to four bonded serials per line
  • After the main parse loop, resolves serial pairs into Geometry::addBond calls; cross-geometry bonds (e.g. protein–ligand) are skipped as they cannot be represented in the current layer model

src/Layer/LayerFactory.C

  • When geometry.hasBonds(), adds explicit OB bonds and calls PerceiveBondOrders only (skips ConnectTheDots)
  • Falls back to the existing distance-based perception when no explicit bonds are present

Limitations / Future Work

  • CONECT records that reference ATOM serial numbers (e.g. disulfide bonds) are not yet resolved — protein chain atoms are not tracked in m_serialToAtom
  • Cross-geometry bonds (protein–ligand covalent attachments) are silently ignored
  • Bond order from CONECT is not encoded (PDB format does not carry it); OpenBabel's PerceiveBondOrders is still used to assign orders once topology is fixed

@Leaf-Lin
Copy link
Copy Markdown
Author

⏺ Here's a prioritized breakdown of improvements, roughly in order of impact:

High Impact

  1. CONECT record parsing
    The parser currently ignores CONECT records entirely. This means bond connectivity for ligands and HETATM molecules is never read — bonds are probably inferred by distance instead, which is error-prone for metal complexes, unusual
    geometries, and disulfide bridges. PdbParser.C has no CONECT branch at all.

  2. MODEL/ENDMDL support for trajectories
    NMR structures and MD snapshots use MODEL/ENDMDL blocks. The parser only breaks on END/ENDMDL but doesn't track model indices, so multi-model files collapse into one geometry. Alanine-path.pdb in the test suite documents this limitation.

  3. Occupancy and B-factor storage
    These fields are read off the line but never stored or propagated. B-factors are essential for visualising crystallographic quality and disorder; occupancy is needed for correctly handling partially-occupied sites.

Medium Impact

  1. Insertion code handling
    PDB residue numbering can include insertion codes (e.g. 47A, 47B) to avoid renumbering legacy structures. The parser ignores the iCode column (col 27), which can cause residues with insertion codes to collide or be dropped.

  2. SSBOND records
    Disulfide bonds are declared explicitly in SSBOND records but are not parsed — they'd need to be inferred from CONECT or distance, which is less reliable.

  3. Alternate conformers
    Only conformer 'A' is kept (if (!alternateLocation.isEmpty() && alternateLocation != 'A') continue). Supporting multiple conformers would be useful for NMR ensembles and crystal structures with disorder.

Lower Impact / Housekeeping

  1. Remove the SecStruc.dat side-effect — writing a file to the same directory as the input is surprising behaviour and will silently fail on read-only paths.

  2. Remove qDebug() from production code — there's a qDebug() << "!!!Secondary structure order..." left in the secondary-structure pass (line 315 of PdbParser.C).

  3. Metadata extraction — HEADER, TITLE, EXPDTA records are skipped. Surfacing the PDB entry ID, title, and experimental method would improve file identification.

  4. Better error reporting — the parser uses a goto error pattern with minimal messages; structured validation errors would help diagnose malformed files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant