Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Contributors

Name | GitHub user | Description | Role
--- |----------------|-------------------------------------------------| ---
Carlos Ugarte | @MuffinLinwist | Data collector, CLDF conversion and annotation | Author, Editor
Frederic Blum | @FredericBlum | CLDF conversion and annotation | Author, Editor
Adriano Ingunza | @BadBatched | Data collector and annotation | Author
Rosa Gonzales | @rosalgm | Data collector and annotation | Author
Jaime Peña | @JaimePenat | Data collector and annotation | Author
Name | GitHub user | Description | Role
--- |---------------|-------------------------------------------------| ---
Carlos Ugarte | @CMUgarte | Data collector, CLDF conversion and annotation | Author, Editor
Frederic Blum | @FredericBlum | CLDF conversion and annotation | Author, Editor
Adriano Ingunza | @BadBatched | Data collector and annotation | Author
Rosa Gonzales | @rosalgm | Data collector and annotation | Author
Jaime Peña | @JaimePenat | Data collector and annotation | Author
1 change: 1 addition & 0 deletions FORMS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ The value-to-form processing is divided into two steps, implemented as methods:
- `FormSpec.clean`: Normalizes a form chunk.

These methods use the attributes of a `FormSpec` instance to configure their behaviour.

- `brackets`: `{'(': ')'}`
Pairs of strings that should be recognized as brackets, specified as `dict` mapping opening string to closing string
- `separators`: `(';', '/', ',')`
Expand Down
38 changes: 37 additions & 1 deletion NOTES.md
Original file line number Diff line number Diff line change
@@ -1 +1,37 @@
work in progress
### Accessing the data
#### Installing dependencies

The first step to access all the contents of the dataset is to clone the repository and install all the necessary requirements.

```
git clone https://github.com/lexibank/northperulex.git
cd northperulex
pip install -e .
```
This includes all packages used for the conversion to CLDF (Cross-Linguistic Data Formats: [https://cldf.clld.org](https://cldf.clld.org)).
The NorthPeruLex dataset can also be downloaded directly as a ZIP file directly from this Github repository or from Zenodo ([10.5281/zenodo.13269802](10.5281/zenodo.13269802)).
If the user wishes to perform the CLDF conversion, they can run the following command:

```
cldfbench lexibank.makecldf lexibank_northperulex.py --concepticon-version=v3.4.0 --glottolog-version=v5.2.1 --clts-version=v2.3.0
```
This command uses the cldfbench package ([https://pypi.org/project/cldfbench/](https://pypi.org/project/cldfbench/)) with the pylexibank plug-in ([https://pypi.org/project/pylexibank/](https://pypi.org/project/pylexibank/)) to automatically convert the data to CLDF using the raw data at the `raw` folder and the latest version
(at the time of the publication of this dataset) of the references catalogs: Concepticon ([https://concepticon.clld.org/](https://concepticon.clld.org/)), for concept glosses; Glottolog ([https://glottolog.org/](https://glottolog.org/)), for language names; and CLTS ([https://clts.clld.org/](https://clts.clld.org/)), for the phonetic transcriptions.

The converted data is located in the `cldf` folder.
All data in the dataset is stored in tabular (CSV) files. Therefore, it can be read on various platforms and environments and manually inspected.

#### Create the wordlist
We provided the user with a `analysis\Makefile` file that creates a wordlist on a TSV file that can be used to manually inspect the data with the help
of EDICTOR web tool ([https://edictor.org/](https://edictor.org/)).
To produce the file, please run the following commands:

```
cd analysis
pip install -r requirements.txt
make wordlist
```

In addition to yielding the word list file (`npl_data.tsv`), the script also
performs an automatic recognition of sound correspondence patterns,
the result of which is stored in the file `npl_patterns.tsv`.
79 changes: 56 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
# CLDF dataset derived from Ugarte et al.'s "NorthPeruLex - A Lexical Dataset of Small Language Families and Isolates from Northern Peru (forthcoming).

[![CLDF validation](https://github.com/lexibank/northperulex/workflows/CLDF-validation/badge.svg)](https://github.com/lexibank/northperulex/actions?query=workflow%3ACLDF-validation)

## How to cite

If you use these data please cite
Expand All @@ -21,40 +19,75 @@ Conceptlists in Concepticon:
- [Swadesh-1952-200](https://concepticon.clld.org/contributions/Swadesh-1952-200)
## Notes

work in progress
### Accessing the data
#### Installing dependencies

The first step to access all the contents of the dataset is to clone the repository and install all the necessary requirements.

```
git clone https://github.com/lexibank/northperulex.git
cd northperulex
pip install -e .
```
This includes all packages used for the conversion to CLDF (Cross-Linguistic Data Formats: [https://cldf.clld.org](https://cldf.clld.org)).
The NorthPeruLex dataset can also be downloaded directly as a ZIP file directly from this Github repository or from Zenodo ([10.5281/zenodo.13269802](10.5281/zenodo.13269802)).
If the user wishes to perform the CLDF conversion, they can run the following command:

```
cldfbench lexibank.makecldf lexibank_northperulex.py --concepticon-version=v3.4.0 --glottolog-version=v5.2.1 --clts-version=v2.3.0
```
This command uses the cldfbench package ([https://pypi.org/project/cldfbench/](https://pypi.org/project/cldfbench/)) with the pylexibank plug-in ([https://pypi.org/project/pylexibank/](https://pypi.org/project/pylexibank/)) to automatically convert the data to CLDF using the raw data at the `raw` folder and the latest version
(at the time of the publication of this dataset) of the references catalogs: Concepticon ([https://concepticon.clld.org/](https://concepticon.clld.org/)), for concept glosses; Glottolog ([https://glottolog.org/](https://glottolog.org/)), for language names; and CLTS ([https://clts.clld.org/](https://clts.clld.org/)), for the phonetic transcriptions.

The converted data is located in the `cldf` folder.
All data in the dataset is stored in tabular (CSV) files. Therefore, it can be read on various platforms and environments and manually inspected.

#### Create the wordlist
We provided the user with a `analysis\Makefile` file that creates a wordlist on a TSV file that can be used to manually inspect the data with the help
of EDICTOR web tool ([https://edictor.org/](https://edictor.org/)).
To produce the file, please run the following commands:

```
cd analysis
pip install -r requirements.txt
make wordlist
```

In addition to yielding the word list file (`npl_data.tsv`), the script also
performs an automatic recognition of sound correspondence patterns,
the result of which is stored in the file `npl_patterns.tsv`.


## Statistics


[![CLDF validation](https://github.com/lexibank/northperulex/workflows/CLDF-validation/badge.svg)](https://github.com/lexibank/northperulex/actions?query=workflow%3ACLDF-validation)
![Glottolog: 97%](https://img.shields.io/badge/Glottolog-97%25-green.svg "Glottolog: 97%")
![Glottolog: 100%](https://img.shields.io/badge/Glottolog-100%25-brightgreen.svg "Glottolog: 100%")
![Concepticon: 100%](https://img.shields.io/badge/Concepticon-100%25-brightgreen.svg "Concepticon: 100%")
![Source: 100%](https://img.shields.io/badge/Source-100%25-brightgreen.svg "Source: 100%")
![BIPA: 98%](https://img.shields.io/badge/BIPA-98%25-green.svg "BIPA: 98%")
![CLTS SoundClass: 98%](https://img.shields.io/badge/CLTS%20SoundClass-98%25-green.svg "CLTS SoundClass: 98%")
![BIPA: 100%](https://img.shields.io/badge/BIPA-100%25-brightgreen.svg "BIPA: 100%")
![CLTS SoundClass: 100%](https://img.shields.io/badge/CLTS%20SoundClass-100%25-brightgreen.svg "CLTS SoundClass: 100%")

- **Varieties:** 35 (linked to 34 different Glottocodes)
- **Varieties:** 35 (linked to 35 different Glottocodes)
- **Concepts:** 200 (linked to 200 different Concepticon concept sets)
- **Lexemes:** 4,804
- **Sources:** 16
- **Synonymy:** 1.11
- **Cognacy:** 4,804 cognates in 3,282 cognate sets (2,554 singletons)
- **Cognate Diversity:** 0.67
- **Lexemes:** 4,986
- **Sources:** 21
- **Synonymy:** 1.12
- **Cognacy:** 4,986 cognates in 3,660 cognate sets (2,905 singletons)
- **Cognate Diversity:** 0.72
- **Invalid lexemes:** 0
- **Tokens:** 27,431
- **Segments:** 149 (3 BIPA errors, 3 CLTS sound class errors, 146 CLTS modified)
- **Inventory size (avg):** 27.91
- **Tokens:** 29,552
- **Segments:** 178 (0 BIPA errors, 0 CLTS sound class errors, 178 CLTS modified)
- **Inventory size (avg):** 29.83

# Contributors

Name | GitHub user | Description | Role
--- |----------------|-------------------------------------------------| ---
Carlos Ugarte | @MuffinLinwist | Data collector, CLDF conversion and annotation | Author, Editor
Frederic Blum | @FredericBlum | CLDF conversion and annotation | Author, Editor
Adriano Ingunza | @BadBatched | Data collector and annotation | Author
Rosa Gonzales | @rosalgm | Data collector and annotation | Author
Jaime Peña | @JaimePenat | Data collector and annotation | Author
Name | GitHub user | Description | Role
--- |---------------|-------------------------------------------------| ---
Carlos Ugarte | @CMUgarte | Data collector, CLDF conversion and annotation | Author, Editor
Frederic Blum | @FredericBlum | CLDF conversion and annotation | Author, Editor
Adriano Ingunza | @BadBatched | Data collector and annotation | Author
Rosa Gonzales | @rosalgm | Data collector and annotation | Author
Jaime Peña | @JaimePenat | Data collector and annotation | Author



Expand Down
Loading