Skip to content

Enhancement: Improve handling of small isolated number words in mixed numeric sequences across languages #128

@riccardoferrara

Description

@riccardoferrara

Enhancement: Improve handling of small isolated number words in mixed numeric sequences across languages

Description

In several supported languages (e.g., Italian, French, Spanish, English), certain small number words — such as “one”, “un”, “une”, “uno”, “una” — can act either as:

  • numeric words (value = 1), or

  • indefinite articles or determiners (meaning a/an).

At the moment, alpha2digit only converts these words to digits if they are part of a recognized numeric group, or if the global threshold is set low enough (e.g., threshold=0).
This behavior can lead to unintuitive results when such words appear inside mixed numeric sequences (containing both digits and number words).


Example

from text_to_num import alpha2digit

text = "The code is 7 one 8 0."
print(alpha2digit(text, "en"))

Current output:

The code is 7 one 8 0.

Expected / Desired output:

The code is 7 1 8 0.

Here, “one” is clearly part of a numeric sequence, but it remains unconverted because it’s treated as an isolated number word below the threshold.


Rationale

In many real-world scenarios (ASR transcripts mainly), numeric sequences often include a mix of digits and number words.
When a small number word (value ≤ 3) appears adjacent to digits or other number words, it should be reasonably interpreted as part of that sequence and converted — regardless of the threshold parameter.

This change would:

  • Improve conversion accuracy across languages,

  • Better reflect numeric context in mixed sequences,

  • Maintain backward compatibility for truly isolated uses (e.g., “I have one apple”).


Proposed enhancement

Extend alpha2digit’s logic so that:

  • If a numeric word is adjacent to digits or other number words, treat it as part of a numeric sequence and bypass the threshold check.

  • Otherwise, keep the current behavior (use threshold to decide conversion).


Examples

Input Language Current Desired
7 one 8 en 7 one 8 7 1 8
7 un 8 fr 7 un 8 7 1 8
7 uno 8 it 7 uno 8 7 1 8
I have one apple en ✅ correct ✅ same
J’ai une pomme fr ✅ correct ✅ same

Implementation idea

Add a lightweight heuristic to alpha2digit:

“If a small numeric word (value ≤ threshold) is adjacent to a digit or another number word, treat it as part of a numeric group and convert it.”

This would make conversions more robust across all supported languages without breaking existing semantics for isolated words.


Motivation / Use Case

This improvement would significantly benefit real-world applications such as:

  • ASR (Automatic Speech Recognition) post-processing,

  • Data cleaning pipelines where numeric tokens are mixed or inconsistent.


Potential test cases

from text_to_num import alpha2digit

def test_mixed_sequence_en():
assert alpha2digit("7 one 8 0", "en") == "7 1 8 0"

def test_mixed_sequence_fr():
assert alpha2digit("7 un 8 0", "fr") == "7 1 8 0"

def test_mixed_sequence_it():
assert alpha2digit("7 uno 8 0", "it") == "7 1 8 0"

def test_isolated_article_en():
assert alpha2digit("I have one apple", "en") == "I have one apple"

Optional: with a mode/flag or heuristic that bypasses threshold when adjacent to numeric tokens


Label: Enhancement

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions