Ghost rows from empty KG node labels matching empty AUTO labels

## Summary

The `label_to_curie` lookup in `auto_terms_table.py` doesn't filter out empty/blank node names from the merged KG, which causes **ghost rows** — rows with no `id`, no `label`, no `original_spans`, but a spurious `kg_id` match.

## Example

This row in `auto_terms_by_microbe_with_kg_ids.csv`:

```
UNKNOWN_MICROBE,,,SAMN08731602:HGPBHO_47720,1,,0,0,1,0
```

Notice: empty `id`, empty `label`, empty `original_spans` — but `kg_id=SAMN08731602:HGPBHO_47720` and `in_kg=1`. This is not a real AUTO term. It's a phantom match.

## Root cause

### Where `SAMN08731602:HGPBHO_47720` actually comes from

This is a **Bakta genome annotation gene node** in the kg-microbe merged knowledge graph. Here's the trail:

1. The [Bakta transform](https://github.com/Knowledge-Graph-Hub/kg-microbe/blob/master/kg_microbe/transform_utils/bakta/bakta.py) processes genome annotations from NCBI BioSamples
2. For each gene, it creates a composite ID via [`create_gene_id()`](https://github.com/Knowledge-Graph-Hub/kg-microbe/blob/master/kg_microbe/transform_utils/bakta/utils.py#L121-L129): `f"{samn_id}:{locus_tag}"` -> `SAMN08731602:HGPBHO_47720`
3. This particular gene is a **"hypothetical protein"** -- it has no gene symbol, so its `name` column in `merged-kg_nodes.tsv` is **empty**

You can verify this yourself:

```bash
grep "HGPBHO_47720" merged-kg_nodes.tsv
```

Output:
```
SAMN08731602:HGPBHO_47720	biolink:Gene		hypothetical protein		Graph
```

Notice the empty `name` column (3rd field) -- the description says "hypothetical protein" but the name is blank.

### How it becomes a ghost row

In `auto_terms_table.py`, this block builds the lookup:

```python
label_to_curie = (
    nodes[[kg_label_col, kg_id_col]]
    .drop_duplicates()
    .assign(_label_norm=lambda x: x[kg_label_col].str.strip().str.lower())
    .set_index("_label_norm")[kg_id_col]
    .to_dict()
)
```

The problem: nodes with **empty `name`** values get indexed as `"" -> "SAMN08731602:HGPBHO_47720"` (whichever empty-name node was indexed last wins). Then when an OntoGPT entity has a missing or empty label, the lookup `label_to_curie.get(normalize_auto_label(x), "")` matches it to this Bakta gene ID.

There are **thousands** of Bakta gene nodes with empty names in the KG (SAMN08731602 alone has 9,388 genes), so the last one indexed "wins" the empty-string slot in the dict.

### Suggested fix

Filter out empty labels before building the dict:

```python
label_to_curie = (
    nodes[[kg_label_col, kg_id_col]]
    .drop_duplicates()
    .assign(_label_norm=lambda x: x[kg_label_col].str.strip().str.lower())
    .query('_label_norm != ""')          # <-- ADD THIS LINE
    .set_index("_label_norm")[kg_id_col]
    .to_dict()
)
```

You might also want to skip the `kg_id` assignment for entities with empty labels:

```python
df["kg_id"] = df["label"].apply(
    lambda x: label_to_curie.get(normalize_auto_label(x), "") if normalize_auto_label(x) else ""
)
```

## How to verify

> **New to the command line?** See [docs/shell-guide.md](https://github.com/CultureBotAI/auto-term-catalog/blob/main/docs/shell-guide.md) (#5) for how to open a terminal, install prerequisites, and use `grep`.

### Prerequisites check

Before running the commands below, open a terminal and make sure these three tools work:

```bash
git --version      # should print git version 2.x.x
python --version   # should print Python 3.x.x (try python3 if python fails)
grep --version     # should print grep (GNU grep) 3.x or grep (BSD grep)
```

If `grep` is not found on Windows: you probably have [GitHub Desktop](https://desktop.github.com/) but not [Git for Windows](https://gitforwindows.org/). They're different things. GitHub Desktop is a GUI app that bundles its own private copy of Git -- it does NOT put `grep` or other Unix tools on your system PATH. Install [Git for Windows](https://gitforwindows.org/) separately (keep the default options during installation, especially **"Git from the command line and also from 3rd-party software"**), then close and reopen your terminal.

### Verification commands

```bash
# cd to wherever your CSV and merged-kg files are
cd ~/Downloads   # or wherever you have them

# 1. Count ghost rows (kg_id present but no label)
grep -c ",,SAMN" auto_terms_by_microbe_with_kg_ids.csv

# 2. Show the ghost rows
grep ",,SAMN" auto_terms_by_microbe_with_kg_ids.csv

# 3. Count empty-name nodes in the merged KG
#    (this needs awk -- if it doesn't work in PowerShell,
#    open Git Bash instead: Win key -> type "Git Bash" -> Enter)
awk -F'\t' '$3 == ""' merged-kg_nodes.tsv | wc -l

# 4. After applying the fix, verify no ghost rows remain
python auto_terms_table.py
grep -c ",,SAMN" auto_terms_by_microbe_off_claude_20251031.csv
# Should print 0
```

## Related

- #5 -- [Shell guide for the team](https://github.com/CultureBotAI/auto-term-catalog/pull/5) (`docs/shell-guide.md`)
- [CultureBotAI/KG-Microbe-search#8](https://github.com/CultureBotAI/KG-Microbe-search/issues/8) -- Conventions for reproducible, schema-validated pipelines (broader best practices for all CultureBotAI repos)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ghost rows from empty KG node labels matching empty AUTO labels #4

Summary

Example

Root cause

Where `SAMN08731602:HGPBHO_47720` actually comes from

How it becomes a ghost row

Suggested fix

How to verify

Prerequisites check

Verification commands

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ghost rows from empty KG node labels matching empty AUTO labels #4

Description

Summary

Example

Root cause

Where SAMN08731602:HGPBHO_47720 actually comes from

How it becomes a ghost row

Suggested fix

How to verify

Prerequisites check

Verification commands

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Where `SAMN08731602:HGPBHO_47720` actually comes from