Tokenization of large(r) digital numbers

### Preliminary Remark
The observations presented here are also relevant for the _polmineR repository._

### Some Background
The _Bundestag Protokolle_ often employ spacing to enhance readability of large numerical values. This approach, while _globally standardized_, may lead to problems with corpus analysis, notably regarding the tokenization process.

For illustration, consider a speech given by then-Chancellor Angela Merkel during the final session of the 17th legislative period (reference: BT_17_253). In this speech, five instances of large numbers grouped with spaces can be identified:

1. Bereits über _100 000_ Menschen haben ihr Leben verloren;
2. Wir haben als erster EU-Mitgliedstaat _5 000_ syrischen Flüchtlingen Aufnahme angeboten.
3. _700 000_ mehr Menschen im Alter von 60 bis 65 sind noch in Arbeit.
4. _650 000_ Menschen erhalten mehr Leistungen.
5. Wir haben seit 2007 in Deutschland _820 000_ neue Betreuungsplätze für Kinder unter drei Jahren geschaffen.

### The Issue
Corpus tools, like PolmineR (and similarly, #LancsBox X), fail to recognize these _groups of spaced numerical values_ as single tokens. Consider the following R code snippet:
```r 
library(polmineR)

merkel_speech <- corpus("GERMAPARL2") |>
  subset(protocol_date == "2013-09-03") |>
  subset(speaker_name == "Angela Merkel") |>
  subset(p_type == "speech") 

count(merkel_speech, query="000")
```

As a matter of fact, polmineR incorrectly counts each spaced segment of the numbers as separate tokens, yielding:

```graphql
   query match count        freq
1:   000   000     5 0.001048218
```
### The Implications 
The implications of this issue are twofold:

1. It inflates the total token count. 
2. It skews statistical measures, such as collocation (aka. co-occurrence) analysis.

The extent of this issue's impact: Employing the regular expression _\b(\d{1,3})(\s)(\d{3})_ returns **134,609** hits (not 100% precision rate!)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization of large(r) digital numbers #10

Preliminary Remark

Some Background

The Issue

The Implications

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenization of large(r) digital numbers #10

Description

Preliminary Remark

Some Background

The Issue

The Implications

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions