-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Preliminary Remark
The observations presented here are also relevant for the polmineR repository.
Some Background
The Bundestag Protokolle often employ spacing to enhance readability of large numerical values. This approach, while globally standardized, may lead to problems with corpus analysis, notably regarding the tokenization process.
For illustration, consider a speech given by then-Chancellor Angela Merkel during the final session of the 17th legislative period (reference: BT_17_253). In this speech, five instances of large numbers grouped with spaces can be identified:
- Bereits über 100 000 Menschen haben ihr Leben verloren;
- Wir haben als erster EU-Mitgliedstaat 5 000 syrischen Flüchtlingen Aufnahme angeboten.
- 700 000 mehr Menschen im Alter von 60 bis 65 sind noch in Arbeit.
- 650 000 Menschen erhalten mehr Leistungen.
- Wir haben seit 2007 in Deutschland 820 000 neue Betreuungsplätze für Kinder unter drei Jahren geschaffen.
The Issue
Corpus tools, like PolmineR (and similarly, #LancsBox X), fail to recognize these groups of spaced numerical values as single tokens. Consider the following R code snippet:
library(polmineR)
merkel_speech <- corpus("GERMAPARL2") |>
subset(protocol_date == "2013-09-03") |>
subset(speaker_name == "Angela Merkel") |>
subset(p_type == "speech")
count(merkel_speech, query="000")As a matter of fact, polmineR incorrectly counts each spaced segment of the numbers as separate tokens, yielding:
query match count freq
1: 000 000 5 0.001048218The Implications
The implications of this issue are twofold:
- It inflates the total token count.
- It skews statistical measures, such as collocation (aka. co-occurrence) analysis.
The extent of this issue's impact: Employing the regular expression \b(\d{1,3})(\s)(\d{3}) returns 134,609 hits (not 100% precision rate!)