Summary
The label_to_curie lookup in auto_terms_table.py doesn't filter out empty/blank node names from the merged KG, which causes ghost rows — rows with no id, no label, no original_spans, but a spurious kg_id match.
Example
This row in auto_terms_by_microbe_with_kg_ids.csv:
UNKNOWN_MICROBE,,,SAMN08731602:HGPBHO_47720,1,,0,0,1,0
Notice: empty id, empty label, empty original_spans — but kg_id=SAMN08731602:HGPBHO_47720 and in_kg=1. This is not a real AUTO term. It's a phantom match.
Root cause
Where SAMN08731602:HGPBHO_47720 actually comes from
This is a Bakta genome annotation gene node in the kg-microbe merged knowledge graph. Here's the trail:
- The Bakta transform processes genome annotations from NCBI BioSamples
- For each gene, it creates a composite ID via
create_gene_id(): f"{samn_id}:{locus_tag}" -> SAMN08731602:HGPBHO_47720
- This particular gene is a "hypothetical protein" -- it has no gene symbol, so its
name column in merged-kg_nodes.tsv is empty
You can verify this yourself:
grep "HGPBHO_47720" merged-kg_nodes.tsv
Output:
SAMN08731602:HGPBHO_47720 biolink:Gene hypothetical protein Graph
Notice the empty name column (3rd field) -- the description says "hypothetical protein" but the name is blank.
How it becomes a ghost row
In auto_terms_table.py, this block builds the lookup:
label_to_curie = (
nodes[[kg_label_col, kg_id_col]]
.drop_duplicates()
.assign(_label_norm=lambda x: x[kg_label_col].str.strip().str.lower())
.set_index("_label_norm")[kg_id_col]
.to_dict()
)
The problem: nodes with empty name values get indexed as "" -> "SAMN08731602:HGPBHO_47720" (whichever empty-name node was indexed last wins). Then when an OntoGPT entity has a missing or empty label, the lookup label_to_curie.get(normalize_auto_label(x), "") matches it to this Bakta gene ID.
There are thousands of Bakta gene nodes with empty names in the KG (SAMN08731602 alone has 9,388 genes), so the last one indexed "wins" the empty-string slot in the dict.
Suggested fix
Filter out empty labels before building the dict:
label_to_curie = (
nodes[[kg_label_col, kg_id_col]]
.drop_duplicates()
.assign(_label_norm=lambda x: x[kg_label_col].str.strip().str.lower())
.query('_label_norm != ""') # <-- ADD THIS LINE
.set_index("_label_norm")[kg_id_col]
.to_dict()
)
You might also want to skip the kg_id assignment for entities with empty labels:
df["kg_id"] = df["label"].apply(
lambda x: label_to_curie.get(normalize_auto_label(x), "") if normalize_auto_label(x) else ""
)
How to verify
New to the command line? See docs/shell-guide.md (#5) for how to open a terminal, install prerequisites, and use grep.
Prerequisites check
Before running the commands below, open a terminal and make sure these three tools work:
git --version # should print git version 2.x.x
python --version # should print Python 3.x.x (try python3 if python fails)
grep --version # should print grep (GNU grep) 3.x or grep (BSD grep)
If grep is not found on Windows: you probably have GitHub Desktop but not Git for Windows. They're different things. GitHub Desktop is a GUI app that bundles its own private copy of Git -- it does NOT put grep or other Unix tools on your system PATH. Install Git for Windows separately (keep the default options during installation, especially "Git from the command line and also from 3rd-party software"), then close and reopen your terminal.
Verification commands
# cd to wherever your CSV and merged-kg files are
cd ~/Downloads # or wherever you have them
# 1. Count ghost rows (kg_id present but no label)
grep -c ",,SAMN" auto_terms_by_microbe_with_kg_ids.csv
# 2. Show the ghost rows
grep ",,SAMN" auto_terms_by_microbe_with_kg_ids.csv
# 3. Count empty-name nodes in the merged KG
# (this needs awk -- if it doesn't work in PowerShell,
# open Git Bash instead: Win key -> type "Git Bash" -> Enter)
awk -F'\t' '$3 == ""' merged-kg_nodes.tsv | wc -l
# 4. After applying the fix, verify no ghost rows remain
python auto_terms_table.py
grep -c ",,SAMN" auto_terms_by_microbe_off_claude_20251031.csv
# Should print 0
Related
Summary
The
label_to_curielookup inauto_terms_table.pydoesn't filter out empty/blank node names from the merged KG, which causes ghost rows — rows with noid, nolabel, nooriginal_spans, but a spuriouskg_idmatch.Example
This row in
auto_terms_by_microbe_with_kg_ids.csv:Notice: empty
id, emptylabel, emptyoriginal_spans— butkg_id=SAMN08731602:HGPBHO_47720andin_kg=1. This is not a real AUTO term. It's a phantom match.Root cause
Where
SAMN08731602:HGPBHO_47720actually comes fromThis is a Bakta genome annotation gene node in the kg-microbe merged knowledge graph. Here's the trail:
create_gene_id():f"{samn_id}:{locus_tag}"->SAMN08731602:HGPBHO_47720namecolumn inmerged-kg_nodes.tsvis emptyYou can verify this yourself:
grep "HGPBHO_47720" merged-kg_nodes.tsvOutput:
Notice the empty
namecolumn (3rd field) -- the description says "hypothetical protein" but the name is blank.How it becomes a ghost row
In
auto_terms_table.py, this block builds the lookup:The problem: nodes with empty
namevalues get indexed as"" -> "SAMN08731602:HGPBHO_47720"(whichever empty-name node was indexed last wins). Then when an OntoGPT entity has a missing or empty label, the lookuplabel_to_curie.get(normalize_auto_label(x), "")matches it to this Bakta gene ID.There are thousands of Bakta gene nodes with empty names in the KG (SAMN08731602 alone has 9,388 genes), so the last one indexed "wins" the empty-string slot in the dict.
Suggested fix
Filter out empty labels before building the dict:
You might also want to skip the
kg_idassignment for entities with empty labels:How to verify
Prerequisites check
Before running the commands below, open a terminal and make sure these three tools work:
If
grepis not found on Windows: you probably have GitHub Desktop but not Git for Windows. They're different things. GitHub Desktop is a GUI app that bundles its own private copy of Git -- it does NOT putgrepor other Unix tools on your system PATH. Install Git for Windows separately (keep the default options during installation, especially "Git from the command line and also from 3rd-party software"), then close and reopen your terminal.Verification commands
Related
docs/shell-guide.md)