Skip to content

New format and error file support #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

hmdne
Copy link

@hmdne hmdne commented May 1, 2024

See also: #1 interscript/geonames-transliteration-data#12

Do note that GeoTest doesn't depend on interscript/geonames-transliteration-data, so interscript/geonames-transliteration-data#12 is NOT fixed by this pull request.

The first commit makes GeoTest compatible with the new format.

The second commit introduces a way to generate an error file. An error file is a TSV file that is described in https://github.com/hmdne/geotest/blob/hmdne/new-format-error-file/errors_documentation.md . In particular, GeoTest gained an ability to infer a transliteration map if either transliteration is not correct (except for punctuation, spacing and casing errors, which are just displayed as errors) or transl_cd is empty.

I will provide a result of this computation as soon as it completes.

hmdne added 2 commits May 1, 2024 06:17
In particular, the following fields are not available in new format:
- full_name_rg

The following fields have been renamed:
- full_name_ro -> full_name
- lcd -> lang_cd
@hmdne
Copy link
Author

hmdne commented May 1, 2024

I am uploading a partial result, as it's still running (transliteration system detection is an expensive operation) and I want you to have the results to look at as soon as possible. I will update it when it finishes.

This dataset contains directories named after the countries, each contains two files. One is "result.txt" which contains output summary of GeoTest. The second one is "errors.tsv" which specifies the individual issues (as described in the post above).

The temporary dataset misses "errors.tsv" for the following countries:

  • Afghanistan
  • China
  • Russia

_output.tar.gz

@hmdne
Copy link
Author

hmdne commented May 3, 2024

The computation has completed. Unfortunately, the output doesn't fit GitHub limit of 25MB of attachment size, and GitHub attachments only support GZ compression, so I will split those into four files:

_output2.tar.aa.gz
_output2.tar.ab.gz
_output2.tar.ac.gz
_output2.tar.ad.gz

To extract this, use the command: ls _output2.tar.*.gz | sort | xargs gunzip -c | tar -xv

To create a single tar.gz archive from those, use this command: ls _output2.tar.*.gz | sort | xargs gunzip -c | gzip -c9 > _output2.tar.gz

Ultimately, this took almost 2 days to compute on a 16-core Ryzen 9 5950x, where each country was processed in parallel (getting at most up to 20GiB of RAM usage for a single process, but usually much lower). I have looked at a potential of optimization, but didn't find anything spectacular - the real bottleneck is transliteration that needs to be done for each transliteration system detection - which means that we have to use each transliteration map if no hint is provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant