New format and error file support #2

hmdne · 2024-05-01T04:53:37Z

See also: #1 interscript/geonames-transliteration-data#12

Do note that GeoTest doesn't depend on interscript/geonames-transliteration-data, so interscript/geonames-transliteration-data#12 is NOT fixed by this pull request.

The first commit makes GeoTest compatible with the new format.

The second commit introduces a way to generate an error file. An error file is a TSV file that is described in https://github.com/hmdne/geotest/blob/hmdne/new-format-error-file/errors_documentation.md . In particular, GeoTest gained an ability to infer a transliteration map if either transliteration is not correct (except for punctuation, spacing and casing errors, which are just displayed as errors) or transl_cd is empty.

I will provide a result of this computation as soon as it completes.

In particular, the following fields are not available in new format: - full_name_rg The following fields have been renamed: - full_name_ro -> full_name - lcd -> lang_cd

hmdne · 2024-05-01T07:46:18Z

I am uploading a partial result, as it's still running (transliteration system detection is an expensive operation) and I want you to have the results to look at as soon as possible. I will update it when it finishes.

This dataset contains directories named after the countries, each contains two files. One is "result.txt" which contains output summary of GeoTest. The second one is "errors.tsv" which specifies the individual issues (as described in the post above).

The temporary dataset misses "errors.tsv" for the following countries:

Afghanistan
China
Russia

_output.tar.gz

hmdne · 2024-05-03T00:44:24Z

The computation has completed. Unfortunately, the output doesn't fit GitHub limit of 25MB of attachment size, and GitHub attachments only support GZ compression, so I will split those into four files:

_output2.tar.aa.gz
_output2.tar.ab.gz
_output2.tar.ac.gz
_output2.tar.ad.gz

To extract this, use the command: ls _output2.tar.*.gz | sort | xargs gunzip -c | tar -xv

To create a single tar.gz archive from those, use this command: ls _output2.tar.*.gz | sort | xargs gunzip -c | gzip -c9 > _output2.tar.gz

Ultimately, this took almost 2 days to compute on a 16-core Ryzen 9 5950x, where each country was processed in parallel (getting at most up to 20GiB of RAM usage for a single process, but usually much lower). I have looked at a potential of optimization, but didn't find anything spectacular - the real bottleneck is transliteration that needs to be done for each transliteration system detection - which means that we have to use each transliteration map if no hint is provided.

hmdne added 2 commits May 1, 2024 06:17

Update GeoTest to support new format.

9e6d873

In particular, the following fields are not available in new format: - full_name_rg The following fields have been renamed: - full_name_ro -> full_name - lcd -> lang_cd

Add a possiblity to generate an error file.

0c845c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New format and error file support #2

New format and error file support #2

Uh oh!

hmdne commented May 1, 2024 •

edited

Loading

Uh oh!

hmdne commented May 1, 2024

Uh oh!

hmdne commented May 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New format and error file support #2

Are you sure you want to change the base?

New format and error file support #2

Uh oh!

Conversation

hmdne commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hmdne commented May 1, 2024

Uh oh!

hmdne commented May 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hmdne commented May 1, 2024 •

edited

Loading