Skip to content

Python can't read colexifications.csv (with default settings) #12

@johenglisch

Description

@johenglisch

I updated the meta database the other day and the clics4 dataset triggered an error. It seems that python's built-in csv parser rejects the file due to a cell being too large:

import csv
import io
import zipfile
with zipfile.ZipFile('cldf/colexifications.csv.zip') as zf:
    with zf.open('colexifications.csv') as raw_f:
        unicode_f = io.TextIOWrapper(raw_f, encoding='utf-8')
        rdr = csv.reader(unicode_f)
        rows = list(rdr)

…throws:

_csv.Error: field larger than field limit (131072)

Consuming programs can avoid the problem by extending python's field size limit, e.g.:

csv.field_size_limit(256 * 1024)

I did that on my end but to be completely honest, the fact that an individual data point can't fit into 128 KiB is kind of a ‘data smell’. Maybe those Forms, Varieties, and Languages columns should be put into separate tables rather than being in-lined arrays within a table cell.

(As a side note: this also means that you currently can't run cldf validate on the data. At least not from the command-line – the dataset can still be validated from within Python after expanding the field size limit.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions