Enhancement: Improve handling of small isolated number words in mixed numeric sequences across languages

<html><head></head><body><h1>Enhancement: Improve handling of small isolated number words in mixed numeric sequences across languages</h1>
<h2>Description</h2>
<p>In several supported languages (e.g., Italian, French, Spanish, English), certain small number words — such as <strong>“one”</strong>, <strong>“un”</strong>, <strong>“une”</strong>, <strong>“uno”</strong>, <strong>“una”</strong> — can act either as:</p>
<ul>
<li>
<p><strong>numeric words</strong> (value = 1), or</p>
</li>
<li>
<p><strong>indefinite articles or determiners</strong> (meaning <em>a/an</em>).</p>
</li>
</ul>
<p>At the moment, <code inline="">alpha2digit</code> only converts these words to digits if they are part of a recognized <strong>numeric group</strong>, or if the global <strong><code inline="">threshold</code></strong> is set low enough (e.g., <code inline="">threshold=0</code>).<br>
This behavior can lead to unintuitive results when such words appear inside <strong>mixed numeric sequences</strong> (containing both digits and number words).</p>
<hr>
<h2>Example</h2>
<pre><code class="language-python">from text_to_num import alpha2digit

text = "The code is 7 one 8 0."
print(alpha2digit(text, "en"))
</code></pre>
<p><strong>Current output:</strong></p>
<pre><code>The code is 7 one 8 0.
</code></pre>
<p><strong>Expected / Desired output:</strong></p>
<pre><code>The code is 7 1 8 0.
</code></pre>
<p>Here, <em>“one”</em> is clearly part of a numeric sequence, but it remains unconverted because it’s treated as an isolated number word below the threshold.</p>
<hr>
<h2>Rationale</h2>
<p>In many real-world scenarios (ASR transcripts mainly), numeric sequences often include a <strong>mix of digits and number words</strong>.<br>
When a small number word (value ≤ 3) appears <strong>adjacent to digits or other number words</strong>, it should be reasonably interpreted as part of that sequence and converted — regardless of the <code inline="">threshold</code> parameter.</p>
<p>This change would:</p>
<ul>
<li>
<p>Improve conversion accuracy across languages,</p>
</li>
<li>
<p>Better reflect numeric context in mixed sequences,</p>
</li>
<li>
<p>Maintain backward compatibility for truly isolated uses (e.g., “I have one apple”).</p>
</li>
</ul>
<hr>
<h2>Proposed enhancement</h2>
<p>Extend <code inline="">alpha2digit</code>’s logic so that:</p>
<ul>
<li>
<p>If a numeric word is <strong>adjacent to digits or other number words</strong>, treat it as part of a numeric sequence and <strong>bypass the <code inline="">threshold</code></strong> check.</p>
</li>
<li>
<p>Otherwise, keep the current behavior (use <code inline="">threshold</code> to decide conversion).</p>
</li>
</ul>
<hr>
<h2>Examples</h2>

Input | Language | Current | Desired
-- | -- | -- | --
7 one 8 | en | 7 one 8 | 7 1 8
7 un 8 | fr | 7 un 8 | 7 1 8
7 uno 8 | it | 7 uno 8 | 7 1 8
I have one apple | en | ✅ correct | ✅ same
J’ai une pomme | fr | ✅ correct | ✅ same


<hr>
<h2>Implementation idea</h2>
<p>Add a lightweight heuristic to <code inline="">alpha2digit</code>:</p>
<blockquote>
<p>“If a small numeric word (value ≤ threshold) is adjacent to a digit or another number word, treat it as part of a numeric group and convert it.”</p>
</blockquote>
<p>This would make conversions more robust across all supported languages without breaking existing semantics for isolated words.</p>
<hr>
<h2>Motivation / Use Case</h2>
<p>This improvement would significantly benefit real-world applications such as:</p>
<ul>
<li>
<p><strong>ASR (Automatic Speech Recognition)</strong> post-processing,</p>
</li>
<li>
<p><strong>Data cleaning pipelines</strong> where numeric tokens are mixed or inconsistent.</p>
</li>
</ul>
<hr>
<h2>Potential test cases</h2>
<pre><code class="language-python">from text_to_num import alpha2digit

def test_mixed_sequence_en():
    assert alpha2digit("7 one 8 0", "en") == "7 1 8 0"

def test_mixed_sequence_fr():
    assert alpha2digit("7 un 8 0", "fr") == "7 1 8 0"

def test_mixed_sequence_it():
    assert alpha2digit("7 uno 8 0", "it") == "7 1 8 0"

def test_isolated_article_en():
    assert alpha2digit("I have one apple", "en") == "I have one apple"

# Optional: with a mode/flag or heuristic that bypasses threshold when adjacent to numeric tokens
</code></pre>
<hr>
<p><strong>Label:</strong> <code inline="">Enhancement</code></p></body></html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Improve handling of small isolated number words in mixed numeric sequences across languages #128

Enhancement: Improve handling of small isolated number words in mixed numeric sequences across languages

Description

Example

Rationale

Proposed enhancement

Examples

Implementation idea

Motivation / Use Case

Potential test cases

Optional: with a mode/flag or heuristic that bypasses threshold when adjacent to numeric tokens

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Input	Language	Current	Desired
7 one 8	en	7 one 8	7 1 8
7 un 8	fr	7 un 8	7 1 8
7 uno 8	it	7 uno 8	7 1 8
I have one apple	en	✅ correct	✅ same
J’ai une pomme	fr	✅ correct	✅ same

Enhancement: Improve handling of small isolated number words in mixed numeric sequences across languages #128

Description

Enhancement: Improve handling of small isolated number words in mixed numeric sequences across languages

Description

Example

Rationale

Proposed enhancement

Examples

Implementation idea

Motivation / Use Case

Potential test cases

Optional: with a mode/flag or heuristic that bypasses threshold when adjacent to numeric tokens

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions