Microsoft Hackathon Data

Data for September 2023 Hackathon on Bible Translation and LLMs

In collaboration with SIL (Summer Institute of Linguistics), Microsoft is hosting a hackathon to explore the use of machine learning to improve the quality of Bible translation (specific tasks will be outlined at the outset of the event).

This repository contains data and code for hitting the ground running.

Introductory Presentations

SIL-Microsoft Hackathon: Introduction to Bible Translation and Project 1

Assisted Translation using AI (Ryder Wishart)

Tokenization of low-resource languages (Ryder Wishart, Bethany Moore, Matthew Shannon)

Data

The data is a collection of Bible translations in the data/bibles.csv file.

Notebooks

notebooks/01_quickstart.ipynb - Quickstart notebook for getting started with the data, including loading the data, and prompting OpenAI with English and source-text data using LangChain
notebooks/99-bibles_provenance.ipynb - Code used to generate combined and consistent parallel Bible data used in this repository

Installation

# Create a virtual environment (optional) and activate it
python3 -m venv venv
source venv/bin/activate

# install all the requirments
pip install -r requirements.txt

Credits

Greek and Hebrew data is sourced from Clear-Bible/macula-greek and Clear-Bible/macula-hebrew respectively.

Parallel English Bibles sourced from scrollmapper/bible_databases.

VREF metadata file sourced from BibleNLP/ebible. (Many more languages aligned to the metadata file can be found in this eBible repository.)

License

MIT License

Translations in the eBible Corpus that have the most permissive licenses (by and by-sa)

The following language codes have highly permissive licenses (including derivative and commercial uses):

['aka', 'amo', 'arbnav', 'asmfb', 'ben2017', 'beo', 'bsj', 'cebulb', 'ckb', 'cmnfeb', 'deu1951', 'dji', 'dov', 'eng-t4t', 'engf35', 'engfbv', 'englsv', 'engourb', 'engtcent', 'engULB', 'ewe', 'francl', 'guj2017', 'gux', 'guxg', 'hatbsa', 'hausa', 'hauulb', 'hin2017', 'hun', 'iloulb', 'indags', 'isn', 'jid', 'jni', 'kan2017', 'kbq', 'kik', 'kiz', 'lin', 'lit', 'lug', 'luo', 'mal', 'malc', 'mar', 'ndg', 'npiulb', 'nya', 'ory', 'pan', 'polsz', 'porblt', 'porbr2018', 'portft', 'reg', 'rmyArli', 'rmyChergash', 'rmyGurbet', 'ronBayash', 'ronludari', 'row', 'sanasm', 'sanben', 'sanbur', 'sandev', 'sanguj', 'sanhk', 'sanias', 'saniso', 'sanitr', 'sankhm', 'sanmal', 'sanori', 'sanpun', 'sansin', 'santam', 'santel', 'santha', 'santib', 'sanurd', 'sanvel', 'sbk', 'sbs', 'spabes', 'spapddpt', 'spavbl', 'swhonen', 'swhulb', 'tam2017', 'tczchongthu', 'tel2017', 'tglulb', 'thd', 'twi', 'uigara', 'uigcyr', 'uiglat', 'uigpin', 'urd', 'vieovcb', 'wbi', 'yij', 'yor', 'zgam']

For more detail on each of these, see data/permissive_licensed_translations.csv.

Focus language for the hackathon

For the hackathon translation task, we will focus on the Amo New Testament, since we do not have an Old Testament for this language yet.

The Amo language is a Niger-Congo language from Nigeria. More info may be found at ethnologue and language-archives.org.

The text for this language can be found in the BibleNLP repo here, and a verse-aligned version with the Greek/Hebrew (Macula Greek and Hebrew data) and English (Berean Standard Bible) can be found under data/amo.json.

The verse-aligned file content contains a JSON array of triplet objects like this:

{
    "vref": "REV 22:21",
    "bsb": {
        "content": "The grace of the Lord Jesus be with all the saints. Amen."
    },
    "macula": {
        "content": "Ἠ  χάρις  τοῦ  Κυρίου  Ἰησοῦ  μετὰ  πάντων."
    },
    "target": {
        "content": "Na nshew nCikilari Yisa so nin ko ngna mine. Uso nani."
    }
}

Note that the amo.json file will not have an empty string in target['content'] for the entire Old Testament. At least, it will until you populate that content!

Because we will not be able to comprehend the output of any generated translations, we will need to rely on techniques such as witholding test data from the New Testament, or other validation methods you might come up with.

Contact

For questions about this code or data, please contact Ryder Wishart.

For questions about the hackathon, please contact Jeremy Hodes or Mark Woodward (SIL), or Chris Priebe (Microsoft).

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
data		data
lib		lib
notebooks		notebooks
project_1_task		project_1_task
project_2_task		project_2_task
.gitignore		.gitignore
README.md		README.md
__init__		__init__
config.ini.template		config.ini.template
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Microsoft Hackathon Data

Introductory Presentations

Data

Notebooks

Installation

Credits

License

Translations in the eBible Corpus that have the most permissive licenses (by and by-sa)

Focus language for the hackathon

Contact

About

Releases

Packages

Contributors 3

Languages

sil-ai/sil-microsoft-hackathon-2023

Folders and files

Latest commit

History

Repository files navigation

Microsoft Hackathon Data

Introductory Presentations

Data

Notebooks

Installation

Credits

License

Translations in the eBible Corpus that have the most permissive licenses (by and by-sa)

Focus language for the hackathon

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages