Skip to content

Ghost rows from empty KG node labels matching empty AUTO labels #4

@turbomam

Description

@turbomam

Summary

The label_to_curie lookup in auto_terms_table.py doesn't filter out empty/blank node names from the merged KG, which causes ghost rows — rows with no id, no label, no original_spans, but a spurious kg_id match.

Example

This row in auto_terms_by_microbe_with_kg_ids.csv:

UNKNOWN_MICROBE,,,SAMN08731602:HGPBHO_47720,1,,0,0,1,0

Notice: empty id, empty label, empty original_spans — but kg_id=SAMN08731602:HGPBHO_47720 and in_kg=1. This is not a real AUTO term. It's a phantom match.

Root cause

Where SAMN08731602:HGPBHO_47720 actually comes from

This is a Bakta genome annotation gene node in the kg-microbe merged knowledge graph. Here's the trail:

  1. The Bakta transform processes genome annotations from NCBI BioSamples
  2. For each gene, it creates a composite ID via create_gene_id(): f"{samn_id}:{locus_tag}" -> SAMN08731602:HGPBHO_47720
  3. This particular gene is a "hypothetical protein" -- it has no gene symbol, so its name column in merged-kg_nodes.tsv is empty

You can verify this yourself:

grep "HGPBHO_47720" merged-kg_nodes.tsv

Output:

SAMN08731602:HGPBHO_47720	biolink:Gene		hypothetical protein		Graph

Notice the empty name column (3rd field) -- the description says "hypothetical protein" but the name is blank.

How it becomes a ghost row

In auto_terms_table.py, this block builds the lookup:

label_to_curie = (
    nodes[[kg_label_col, kg_id_col]]
    .drop_duplicates()
    .assign(_label_norm=lambda x: x[kg_label_col].str.strip().str.lower())
    .set_index("_label_norm")[kg_id_col]
    .to_dict()
)

The problem: nodes with empty name values get indexed as "" -> "SAMN08731602:HGPBHO_47720" (whichever empty-name node was indexed last wins). Then when an OntoGPT entity has a missing or empty label, the lookup label_to_curie.get(normalize_auto_label(x), "") matches it to this Bakta gene ID.

There are thousands of Bakta gene nodes with empty names in the KG (SAMN08731602 alone has 9,388 genes), so the last one indexed "wins" the empty-string slot in the dict.

Suggested fix

Filter out empty labels before building the dict:

label_to_curie = (
    nodes[[kg_label_col, kg_id_col]]
    .drop_duplicates()
    .assign(_label_norm=lambda x: x[kg_label_col].str.strip().str.lower())
    .query('_label_norm != ""')          # <-- ADD THIS LINE
    .set_index("_label_norm")[kg_id_col]
    .to_dict()
)

You might also want to skip the kg_id assignment for entities with empty labels:

df["kg_id"] = df["label"].apply(
    lambda x: label_to_curie.get(normalize_auto_label(x), "") if normalize_auto_label(x) else ""
)

How to verify

New to the command line? See docs/shell-guide.md (#5) for how to open a terminal, install prerequisites, and use grep.

Prerequisites check

Before running the commands below, open a terminal and make sure these three tools work:

git --version      # should print git version 2.x.x
python --version   # should print Python 3.x.x (try python3 if python fails)
grep --version     # should print grep (GNU grep) 3.x or grep (BSD grep)

If grep is not found on Windows: you probably have GitHub Desktop but not Git for Windows. They're different things. GitHub Desktop is a GUI app that bundles its own private copy of Git -- it does NOT put grep or other Unix tools on your system PATH. Install Git for Windows separately (keep the default options during installation, especially "Git from the command line and also from 3rd-party software"), then close and reopen your terminal.

Verification commands

# cd to wherever your CSV and merged-kg files are
cd ~/Downloads   # or wherever you have them

# 1. Count ghost rows (kg_id present but no label)
grep -c ",,SAMN" auto_terms_by_microbe_with_kg_ids.csv

# 2. Show the ghost rows
grep ",,SAMN" auto_terms_by_microbe_with_kg_ids.csv

# 3. Count empty-name nodes in the merged KG
#    (this needs awk -- if it doesn't work in PowerShell,
#    open Git Bash instead: Win key -> type "Git Bash" -> Enter)
awk -F'\t' '$3 == ""' merged-kg_nodes.tsv | wc -l

# 4. After applying the fix, verify no ghost rows remain
python auto_terms_table.py
grep -c ",,SAMN" auto_terms_by_microbe_off_claude_20251031.csv
# Should print 0

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions