kathairo is a comprehensive Python library for Scripture text processing that converts USFM and USX formats into structured TSV files. Built on SIL's machine.py, kathairo provides both verse-level and token-level outputs.
pip install kathairoimport kathairo
# Simple fluent API
kathairo.from_usfm_corpus("path/to/usfm") \
.with_project_name("MyBible") \
.with_versification("path/to/versification.vrs") \
.with_language("eng") \
.use_latin_ws_tokenizer() \
.build()
# Output: output/eng/MyBible/token/token_MyBible.tsv
# output/eng/MyBible/verse/verse_MyBible.tsv- Python 3.11 or 3.12
- Dependencies (automatically installed):
sil-machine==1.8.4spacy>=3.7.5polars>=1.4.0pandas>=3.0.0
pip install kathairogit clone https://github.com/Clear-Bible/kathairo.py.git
cd kathairo.py
poetry installkathairo offers three APIs to fit your workflow:
The builder pattern provides method chaining:
import kathairo
# Start with a corpus source
kathairo.from_usfm_corpus("resources/eng/ESV/usfm") \
.with_project_name("ESV") \
.with_versification("resources/eng/versification.vrs") \
.with_language("eng") \
.use_latin_ws_tokenizer() \
.exclude_cross_references() \
.build()Type Safety: The builder enforces required parameters at development time. Your IDE won't show .build() until you've provided:
- A corpus source (from_usfm_corpus, from_usx_corpus, or from_tsv)
- A project name (
.with_project_name()) - An output location (
.with_language()or.with_output_dir())
import kathairo
kathairo.create_tsv(
targetUsfmCorpusPath="resources/eng/ESV/usfm",
projectName="ESV",
targetVersificationPath="resources/eng/versification.vrs",
language="eng",
latinWhiteSpaceIncludedTokenizer=True,
excludeCrossReferences=True
)projects.json:
[
{
"projectName": "ESV",
"language": "eng",
"targetUsfmCorpusPath": "resources/eng/ESV/usfm",
"targetVersificationPath": "resources/eng/versification.vrs",
"latinWhiteSpaceIncludedTokenizer": true,
"excludeCrossReferences": true
},
{
"projectName": "RVR1960",
"language": "spa",
"targetUsfmCorpusPath": "resources/spa/RVR1960/usfm",
"targetVersificationPath": "resources/spa/versification.vrs",
"latinWhiteSpaceIncludedTokenizer": true
}
]Process in parallel:
import kathairo
kathairo.from_config_file("projects.json").build()output/
└── {language}/
└── {projectName}/
├── token/
│ └── token_{projectName}.tsv
└── verse/
└── verse_{projectName}.tsv
kathairo.from_usfm_corpus("path/to/usfm") \
.with_project_name("MyBible") \
.with_versification("path/to/versification.vrs") \
.with_output_dir("custom/output/path") \
.use_latin_ws_tokenizer() \
.build()
# Output: custom/output/path/token/token_MyBible.tsv
# custom/output/path/verse/verse_MyBible.tsv| Column | Description |
|---|---|
id |
Verse identifier (BBCCCVVV format) |
source_verse |
Source versification verse ID |
text |
Complete verse text |
id_range_end |
End verse for verse ranges |
source_verse_range_end |
End verse in source versification |
| Column | Description |
|---|---|
id |
Token identifier (BBCCCVVVWWW format) |
source_verse |
Source versification verse ID |
text |
Token text |
skip_space_after |
"y" if no space should follow, empty otherwise |
exclude |
"y" if token should be excluded, empty otherwise |
id_range_end |
End verse for verse ranges |
source_verse_range_end |
End verse in source versification |
required |
"y" if token contains non-punctuation |
Example Token Output (Genesis 1:1):
id source_verse text skip_space_after exclude required
01001001001 01001001 In y
01001001002 01001001 the y
01001001003 01001001 beginning y
01001001004 01001001 God y
01001001010 01001001 earth y y
01001001011 01001001 . y n| Parameter | Description |
|---|---|
projectName |
Project identifier (used in file naming) |
| One corpus source: | |
targetUsfmCorpusPath |
Path to USFM files directory |
targetUsxCorpusPath |
Path to USX files directory |
tsvPath |
Path to existing token TSV (for re-versification) |
| One output location: | |
language |
Language code (creates output/{language}/{projectName}/) |
output_dir |
Custom output directory |
| One tokenizer: | |
latinTokenizer |
Standard Latin word tokenizer |
latinWhiteSpaceIncludedTokenizer |
Latin tokenizer with whitespace preservation (recommended) |
chineseTokenizer |
Chinese Bible word tokenizer |
| Parameter | Type | Description |
|---|---|---|
targetVersificationPath |
string | Path to versification file (.vrs). Defaults to English versification if not provided. |
treatApostropheAsSingleQuote |
boolean | Handle apostrophes as single quotes |
excludeBracketedText |
boolean | Exclude text within square brackets |
excludeCrossReferences |
boolean | Exclude cross-reference text |
stopWordsPath |
string | Path to TSV file containing stop words |
zwRemovalPath |
string | Path to TSV file for zero-width char removal |
regexRulesPath |
string | Path to custom regex rules module |
psalmSuperscriptionTag |
string | USFM tag for psalm superscriptions (default: "d") |
import kathairo
kathairo.from_usfm_corpus("resources/eng/ESV/usfm") \
.with_project_name("ESV") \
.with_versification("resources/eng/versification.vrs") \
.with_language("eng") \
.use_latin_ws_tokenizer() \
.build()kathairo.from_usfm_corpus("path/to/usfm") \
.with_project_name("MyBible") \
.with_versification("path/to/versification.vrs") \
.with_output_dir("~/Documents/bible-data") \
.use_latin_ws_tokenizer() \
.build()kathairo.from_usx_corpus("resources/eng/NIV/usx") \
.with_project_name("NIV") \
.with_versification("resources/eng/versification.vrs") \
.with_language("eng") \
.use_latin_ws_tokenizer() \
.exclude_cross_references() \
.exclude_bracketed_text() \
.build()kathairo.from_usfm_corpus("resources/zho/CUV/usfm") \
.with_project_name("CUV") \
.with_versification("resources/zho/versification.vrs") \
.with_language("zho") \
.use_chinese_tokenizer() \
.build()kathairo.from_usfm_corpus("resources/hin/IRV/usfm") \
.with_project_name("IRVHin") \
.with_versification("resources/hin/versification.vrs") \
.with_language("hin") \
.use_latin_ws_tokenizer() \
.with_stop_words("resources/hin/stopwords.tsv") \
.with_zw_removal("resources/hin/zw_removal.tsv") \
.with_regex_rules("resources/hin/custom_regex.py") \
.exclude_cross_references() \
.build()# Re-versify an existing token TSV with a new versification
kathairo.from_tsv("output/eng/ESV/token/token_ESV.tsv") \
.with_project_name("ESV_NewVersification") \
.with_versification("resources/eng/new_versification.vrs") \
.with_language("eng") \
.use_latin_ws_tokenizer() \
.build()import kathairo
projects = [
{
"targetUsfmCorpusPath": "resources/eng/ESV/usfm",
"projectName": "ESV",
"targetVersificationPath": "resources/eng/versification.vrs",
"language": "eng",
"latinWhiteSpaceIncludedTokenizer": True
},
{
"targetUsfmCorpusPath": "resources/spa/RVR1960/usfm",
"projectName": "RVR1960",
"targetVersificationPath": "resources/spa/versification.vrs",
"language": "spa",
"latinWhiteSpaceIncludedTokenizer": True
}
]
for project in projects:
kathairo.create_tsv(config_object=project)# Run all tests
poetry run pytest
# Run specific test file
poetry run pytest tests/test_builder.py
# Run with verbose output
poetry run pytest -v
# Run with coverage
poetry run pytest --cov=kathairotests/test_builder.py- Tests for fluent builder APItests/test_type_safety.py- Tests for type-safe builder patterntests/test_output_dir.py- Tests for output_dir functionality
# Start with a corpus source
kathairo.from_usfm_corpus(path: str) -> CorpusBuilder
kathairo.from_usx_corpus(path: str) -> CorpusBuilder
kathairo.from_tsv(path: str) -> CorpusBuilder
kathairo.from_config_file(path: str) -> ConfigBuilder
kathairo.from_config(obj: dict) -> ConfigBuilder
# CorpusBuilder methods (requires .with_project_name() next)
.with_versification(path: str) -> CorpusBuilder
.use_latin_tokenizer() -> CorpusBuilder
.use_latin_ws_tokenizer() -> CorpusBuilder
.use_chinese_tokenizer() -> CorpusBuilder
.exclude_bracketed_text() -> CorpusBuilder
.exclude_cross_references() -> CorpusBuilder
.treat_apostrophe_as_single_quote() -> CorpusBuilder
.with_psalm_superscription_tag(tag: str) -> CorpusBuilder
.with_regex_rules(path: str) -> CorpusBuilder
.with_stop_words(path: str) -> CorpusBuilder
.with_zw_removal(path: str) -> CorpusBuilder
.with_metadata_source_url(url: str) -> CorpusBuilder
.with_metadata_path(path: str) -> CorpusBuilder
.with_metadata_kind(kind: str) -> CorpusBuilder
.with_project_name(name: str) -> ProjectBuilder
# ProjectBuilder methods (requires .with_language() or .with_output_dir() next)
# ... same configuration methods as CorpusBuilder
.with_language(lang: str) -> CompleteBuilder
.with_output_dir(path: str) -> CompleteBuilder
# CompleteBuilder methods (can call .build())
# ... same configuration methods
.build() -> Nonekathairo.create_tsv(
# Config sources (use one)
config_path: str = None,
config_object: dict = None,
# Direct parameters
targetUsfmCorpusPath: str = None,
targetUsxCorpusPath: str = None,
tsvPath: str = None,
targetVersificationPath: str = None,
latinTokenizer: bool = False,
latinWhiteSpaceIncludedTokenizer: bool = False,
chineseTokenizer: bool = False,
excludeBracketedText: bool = False,
excludeCrossReferences: bool = False,
psalmSuperscriptionTag: str = 'd',
treatApostropheAsSingleQuote: bool = False,
regexRulesPath: str = None,
stopWordsPath: str = None,
zwRemovalPath: str = None,
language: str = None,
projectName: str = None,
output_dir: str = None,
metadata_source_url: str = None,
metadata_path: str = None,
metadata_kind: str = None
) -> Nonekathairo/
├── src/kathairo/
│ ├── parsing/ # USFM and USX parsers
│ ├── tokenization/ # Tokenizer implementations
│ ├── tsvs/ # TSV building and processing
│ ├── versification/ # Versification utilities
│ ├── helpers/ # Utility functions
│ ├── params.py # Parameter definitions (source of truth)
│ ├── api.py # Main API (create_tsv function)
│ └── builder.py # Fluent builder pattern
├── tests/ # Test suite
│ ├── test_builder.py
│ ├── test_output_dir.py
│ └── test_type_safety.py
└── pyproject.toml
Robertson Brinker - robertson.brinker@biblica.com
- Built on SIL's machine.py - Machine learning and NLP library for Scripture
- Type-safe builder pattern inspired by modern API design principles
- Update version in
pyproject.toml - Run tests:
poetry run pytest tests - Build and publish:
poetry config repositories.pypi https://upload.pypi.org/legacy/
poetry publish --build --username __token__ --password <api-token>