-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Enhancement: Improve handling of small isolated number words in mixed numeric sequences across languages
Description
In several supported languages (e.g., Italian, French, Spanish, English), certain small number words — such as “one”, “un”, “une”, “uno”, “una” — can act either as:
-
numeric words (value = 1), or
-
indefinite articles or determiners (meaning a/an).
At the moment, alpha2digit only converts these words to digits if they are part of a recognized numeric group, or if the global threshold is set low enough (e.g., threshold=0).
This behavior can lead to unintuitive results when such words appear inside mixed numeric sequences (containing both digits and number words).
Example
from text_to_num import alpha2digit
text = "The code is 7 one 8 0."
print(alpha2digit(text, "en"))
Current output:
The code is 7 one 8 0.
Expected / Desired output:
The code is 7 1 8 0.
Here, “one” is clearly part of a numeric sequence, but it remains unconverted because it’s treated as an isolated number word below the threshold.
Rationale
In many real-world scenarios (ASR transcripts mainly), numeric sequences often include a mix of digits and number words.
When a small number word (value ≤ 3) appears adjacent to digits or other number words, it should be reasonably interpreted as part of that sequence and converted — regardless of the threshold parameter.
This change would:
-
Improve conversion accuracy across languages,
-
Better reflect numeric context in mixed sequences,
-
Maintain backward compatibility for truly isolated uses (e.g., “I have one apple”).
Proposed enhancement
Extend alpha2digit’s logic so that:
-
If a numeric word is adjacent to digits or other number words, treat it as part of a numeric sequence and bypass the
thresholdcheck. -
Otherwise, keep the current behavior (use
thresholdto decide conversion).
Examples
| Input | Language | Current | Desired |
|---|---|---|---|
| 7 one 8 | en | 7 one 8 | 7 1 8 |
| 7 un 8 | fr | 7 un 8 | 7 1 8 |
| 7 uno 8 | it | 7 uno 8 | 7 1 8 |
| I have one apple | en | ✅ correct | ✅ same |
| J’ai une pomme | fr | ✅ correct | ✅ same |
Implementation idea
Add a lightweight heuristic to alpha2digit:
“If a small numeric word (value ≤ threshold) is adjacent to a digit or another number word, treat it as part of a numeric group and convert it.”
This would make conversions more robust across all supported languages without breaking existing semantics for isolated words.
Motivation / Use Case
This improvement would significantly benefit real-world applications such as:
-
ASR (Automatic Speech Recognition) post-processing,
-
Data cleaning pipelines where numeric tokens are mixed or inconsistent.
Potential test cases
from text_to_num import alpha2digitdef test_mixed_sequence_en():
assert alpha2digit("7 one 8 0", "en") == "7 1 8 0"def test_mixed_sequence_fr():
assert alpha2digit("7 un 8 0", "fr") == "7 1 8 0"def test_mixed_sequence_it():
assert alpha2digit("7 uno 8 0", "it") == "7 1 8 0"def test_isolated_article_en():
assert alpha2digit("I have one apple", "en") == "I have one apple"Optional: with a mode/flag or heuristic that bypasses threshold when adjacent to numeric tokens
Label: Enhancement