Skip to content

numeralbank/cosinus

Repository files navigation

CLDF Dataset presenting Rubehn et al.'s "Compositional Structures in Numeral Systems Across Languages" from 2025

CLDF validation

How to cite

If you use these data please cite

  • the original source

    Rubehn, A., C. Rzymski, L. Ciucci, K. Bocklage, A. Kučerová, D. Snee, A. Stephen, K. P. van Dam, and J.-M. List (2025): Annotating and Inferring Compositional Structures in Numeral Systems Across Languages. In: Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP (SIGTYP 2025). 29-42. https://doi.org/10.18653/v1/2025.sigtyp-1.4

  • the derived dataset using the DOI of the particular released version you were using

Description

CLDF dataset providing annotated numeral systems.

This dataset is licensed under a CC-BY-4.0 license

Notes

A Note on Browsing the Data with PyCLDF

If you have installed pycldf via pip in Python (using a fresh virtual environment), you can easily browse the data in the following way. The following code accesses all forms in the data. The individual data types are automatically imposed by pycldf.

from pycldf import Dataset

ds = Dataset.from_metadata("cldf/cldf-metadata.json")

forms = ds.objects("FormTable")

Non-canonical data (which the dataset contains) can be accessed via the data attribute of each Form-object in the list of forms.

In : forms[2500].data
Out: 
OrderedDict([('ID', 'Lamani-thirtyfive-1'),
             ('Local_ID', None),
             ('Language_ID', 'Lamani'),
             ('Parameter_ID', 'thirtyfive'),
             ('Value', 'tis an paanc'),
             ('Form', 'tis an paanc'),
             ('Segments',
              ['t', 'i', 's', '+', 'a', 'n', '+', 'p', 'aː', 'n', 'ʧ']),
             ('Comment', None),
             ('Source', ['TrailLamani1970']),
             ('Cognacy', None),
             ('Loan', None),
             ('Graphemes', None),
             ('Profile', None),
             ('Morphemes', ['thirty', 'and', 'five']),
             ('Cognates', [26, 1, 6]),
             ('Surface_Form', ['t i s', 'a n', 'p aː n ʧ']),
             ('Underlying_Form', ['t i s', 'a n', 'p aː n ʧ']),
             ('Tokens',
              ['t', 'i', 's', '+', 'a', 'n', '+', 'p', 'aː', 'n', 'ʧ'])])

Each form links two the language table via the Language_ID and to the concept table (ParameterTable in CLDF terminology) via the Parameter_ID. To combine the information, by pulling more detailed information on the language and the concept, can can load the languages and the concepts from the CLDF datasets as follows.

languages = ds.objects("LanguageTable")
concepts = ds.objects("ParameterTable")

The three objects forms, languages, and concepts are no typical Python lists, but a combination of a list and a dictionary. One can access them by using the identifier of an entry as key and by passing an integer as index.

This allows us now to access more specific information on the concept and the lanuage from the form that we accessed above.

this_form = forms[2500].data
this_concept = concepts[this_form["Parameter_ID"]].data
this_language = languages[this_form["Language_ID"]].data

This yields the following data for the language:

In : this_language
Out: 
OrderedDict([('ID', 'Lamani'),
             ('Name', 'Lamani'),
             ('Glottocode', 'lamb1269'),
             ('Glottolog_Name', 'Lambadi'),
             ('ISO639P3code', 'lmn'),
             ('Macroarea', 'Eurasia'),
             ('Latitude', Decimal('16.5734')),
             ('Longitude', Decimal('76.9717')),
             ('Family', 'Indo-European'),
             ('Sources', 'TrailLamani1970'),
             ('FileName', 'lamb1269'),
             ('Base', '10')])

For the concept, we receive the following data.

In [47]: this_concept
Out[47]: 
OrderedDict([('ID', 'thirtyfive'),
             ('Name', 'thirty five'),
             ('Concepticon_ID', '3484'),
             ('Concepticon_Gloss', 'THIRTY FIVE'),
             ('Number', '35'),
             ('NumberValue', 35)])

We can illustrate how the information can be combined by now pulling essential information from the three tables and putting them in a single table. To make sure entries that are represented in their internal Python datatypes are rendered correctly as strings in the table, we must convert them accordingly (this relatest to Surface_Form, Underlying_From, Cognates, and Morphemes).

table = []
for form in map(lambda x: x.data, forms):
    language, concept = languages[form["Language_ID"]].data, concepts[form["Parameter_ID"]].data
    if concept["NumberValue"] == 5:
        table += [[language["Name"], language["Glottocode"], concept["Number"],
                  " + ".join(form["Surface_Form"]), 
                  " + ".join(form["Underlying_Form"]), 
                  " ".join(map(lambda x: str(x), form["Cognates"])),
                  " ".join(form["Morphemes"])]]

With the help of the tabulate package, we can print out this information to table.

from tabulate import tabulate

print(tabulate(table, headers=["Language", "Glott.", "Nr.", "Surface", 
                               "Underlying", "Morphemes", "Cognates"],
               tablefmt="pipe"))

The output is given in the table below.

Language Glott. Nr. Surface Underlying Morphemes Cognates
Acehnese achi1257 5 l i m ʌ ŋ l i m ʌ ŋ 2 five
Amharic amha1245 5 a m m ɨ s t a m m ɨ s t 5 five
Assamese assa1263 5 p ã s p ɔ n 5 five
Barwar Neo-Aramaic assy1241 5 x a m ʃ a x a m ʃ a 6 five
Aymara ayma1253 5 pʰ i s q a pʰ i s q a 4 five
Cavineña cavi1250 5 p i ɕ i k a p i ɕ i k a 10 five
Chiquitano/Bésɨro chiq1248 5 m a m a 4 five
Classical Syriac clas1252 5 ħ a m ʃ + aː ħ a m e ʃ + aː 5 11 five suff_gender1
Classical Syriac clas1252 5 ħ a m e ʃ ħ a m e ʃ 5 five
Czech czec1258 5 p j ɛ t p j ɛ t 5 five
Dhivehi dhiv1236 5 f a s f a s 8 five
Yiddish east2295 5 f ɪ n f f ɪ n f 5 five
French stan1290 5 s ɛ̃ k s ɛ̃ k 5 five
Ge'ez geez1241 5 x æ m ɨ s + t + u x æ m s + t + u 5 12 11 five suff_gender suff_nom_sg1
Ge'ez geez1241 5 x æ m s x æ m s 5 five
German stan1295 5 f ʏ n f f ʏ n f 4 five
Modern Hebrew hebr1245 5 χ a m i ʃ + a χ a m e ʃ + a 5 13 five suff_gender2
Modern Hebrew hebr1245 5 χ a m e ʃ χ a m e ʃ 5 five
Hindi hind1269 5 p ãː tʃ p a n tʃ 5 five
Huallaga Quechua hual1241 5 p i tʃ q a p i tʃ q a 6 five
Hungarian hung1274 5 ø t ø t 5 five
Iraqw iraq1241 5 k o o ʔ á n k o o ʔ á n 3 five
Irish iris1253 5 k uː ɟ k uː ɟ 5 five
Italian ital1282 5 t͡ʃ i n k we t͡ʃ i n k we 5 five
Kalasha kala1372 5 p o n ʤ p o n ʤ 3 five
Uipo (Maringic) khoi1251 5 pʰ ə + ŋ ɑ̃ pʰ ə + ŋ ɑ̃ 51 5 PFX2 five
Kumzari uigh1240 5 p a n ʤ p a n ʤ 5 five
Lamani lamb1269 5 p aː n ʧ p aː n ʧ 6 five
Lamjung Yolmo lamj1247 5 ŋ a ŋ a 1 five
Latin lati1261 5 kʷ iː ŋ kʷ ɛ kʷ iː ŋ kʷ ɛ 3 five
Lishana Deni lish1247 5 χ a m ʃ a χ a m ʃ a 5 five
Makyam (Khalai) maky1236 5 pʰ ə ³¹ + ŋ a ³³ pʰ ə ³¹ + ŋ a ³³ 50 5 PFX1 five
Maltese malt1254 5 h ɐ m s ɐ h ɐ m s ɐ 5 five
Mandarin mand1415 5 u ²¹³ u ²¹³ 5 five
Burmese mand1476 5 ŋ á ŋ á 5 five
Mapudungun mapu1245 5 k e ʧ u k e ʧ u 6 five
Georgian nucl1302 5 χ u tʰ + i χ u tʰ + i 4 7 five nominative
Paraguayan Guarani para1311 5 p o p o 3 five
Pashto pash1269 5 p i n z ə p i n z ə 4 five
Russian russ1263 5 pʲ a tʲ pʲ a tʲ 7 five
Sanskrit sans1269 5 p a ɲ tʃ a n p a ɲ tʃ a 5 five
Scottish Gaelic scot1245 5 k oː gʲ k oː gʲ 5 five
Sebat Bet Gurage seba1251 5 a m m ɨ s t a m m ɨ s t 5 five
Siraiki sera1259 5 p ã ʤ p ã ʤ 10 five4
Shanghainese shan1293 5 ɦ ŋ̍ ²³ ɦ ŋ̍ ²³ 5 five
Spanish stan1288 5 θ i ŋ k o θ i ŋ k o 5 five
Standard Arabic stan1318 5 x a m s + a t + u n x a m s + a t + u n 6 13 12 five suff_gender1 nom_sg
Standard Arabic stan1318 5 x a m s + u n x a m s + u n 6 12 five nom_sg
Telugu telu1262 5 ʌ j i d u ʌ j i d u 5 five
Tigre tigr1270 5 ħ a m ɨ s ħ a m ɨ s 6 five
Tommo So tomm1242 5 ǹ n ɔ́ ǹ n ɔ́ 3 five
Uyghur uigh1240 5 b ɛ ʃ b ɛ ʃ 4 five
Wayuu wayu1243 5 h a ʔ r a i h a ʔ r a ɺ i 9 five
Balochi west2368 5 p ə n ʧ p ə n ʧ 6 five
Wolam wola1254 5 p ə ŋ u p ə ŋ u 5 five

It is clear, that the table in this form can also be exported to a CSV file that would have a flat structure (as opposed to the multi-table representation in CLDF). The same can also be achieved by converting the data to SQLite and then extracting the data from the SQLite database in one single table in CSV format. The following SQLite code can for example be simply written to a file flat.sql and later called with SQLite.

.mode csv
.headers on
select 
    l.cldf_name, 
    l.cldf_glottocode, 
    l.cldf_latitude, 
    l.cldf_longitude, 
    c.numbervalue, 
    c.concepticon_gloss, 
    f.surface_form, 
    f.underlying_form 
from 
  languagetable as l, 
  parametertable as c, 
  formtable as f 
where 
    f.cldf_languagereference = l.cldf_id and 
    f.cldf_parameterreference = c.cldf_id
;

To convert the database to SQLite format, the pycldf package offers a commandline command.

$ cldf createdb cldf/cldf-metadata.json cosinus.sqlite

Then, having created the file flat.sql with the code above, one can use the command from the commandline as follows to extract the combined information from the tables and write them to a single CSV file.

$ sqlite3 cosinus.sqlite < flat.sql > cosinus.csv

Statistics

CLDF validation Glottolog: 100% Concepticon: 100% Source: 100% BIPA: 100% CLTS SoundClass: 100%

  • Varieties: 52 (linked to 52 different Glottocodes)
  • Concepts: 99 (linked to 99 different Concepticon concept sets)
  • Lexemes: 5,751
  • Sources: 81
  • Synonymy: 1.12
  • Invalid lexemes: 0
  • Tokens: 74,616
  • Segments: 249 (1 BIPA errors, 1 CLTS sound class errors, 243 CLTS modified)
  • Inventory size (avg): 24.15

Contributors

Name GitHub user Description Role
Arne Rubehn @arubehn data annotation, CLDF conversion Author
Christoph Rzymski @chrzyki CLDF conversion Author
Luca Ciucci data annotation Author
Katja Bocklage data annotation Author
Alžběta Kučerová data annotation Author
David Snee data annotation Author
Kellen Parker van Dam data annotation Author
Chundra Cathcart @chundrac data annotation Author
Carlo Meloni @Cymelo92 data annotation Author
Johann-Mattis List @lingulist CLDF conversion, data annotation Author

CLDF Datasets

The following CLDF datasets are available in cldf: