Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 50 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,36 @@ There are three subtasks, which will be scored separately. Participant teams may

In all three subtasks, the data is randomly split into training (80%), development (10%), and testing (10%) data.

(N.B. Due to quality concerns, the originally planned Devanagari-based subset was removed from the task. We will release this dataset later for community use.)

#### Subtask 1: Multilingual G2P (Shared Orthography):

Participants will be provided three multilingual training sets. Each dataset will be composed of languages utilizing the same orthography (Roman, Cyrillic, Arabic). For evaluation, each training set will be paired with a test set, with each test set composed of samples from the training languages along with an additional unseen language of the same orthography. Models in this track will be tasked to evaluate per training-test dataset pairing (i.e. models are trained on one orthography at a time).
Participants will be provided three multilingual training sets. Each dataset will be composed of languages utilizing the same orthography (Roman, Cyrillic, Arabic). For evaluation, each training set will be paired with a test set, with each test set composed of samples from the training languages along with an additional unseen language of the same orthography. Models in this track will be tasked to evaluate per training-test dataset pairing (i.e. models are trained on one orthography at a time).

The evaluation languages for Task 1 are (surprise language in *italics*):

Abjad:
- Modern Standard Arabic
- Farsi
- Pashto
- Urdu
- *South Levantine Arabic*

Cyrillic:
- Bulgarian
- Macedonian
- Russian
- Ukranian
- *Serbo-Croatian*

Data for Subtask 1 is found in the folder `data/eval/task_1`. Each evaluation set is grouped by orthography system. (i.e. `{latin/cyrillic/abjad}_test.tsv`.) We also provide per language evaluation sets.
Latin:
- English
- Indonesian
- Spanish
- Tagalog
- *Hungarian*

Data for Subtask 1 is found in the folder `data/eval/task_1`. Each evaluation set is grouped by orthography system. (i.e. `{latin/cyrillic/abjad}_test.tsv`.) We also provide per language evaluation sets.

#### Subtask 2: Multilingual G2P (Restricted Orthography):

Expand All @@ -34,9 +59,28 @@ Data for Subtask 2 is found in the folder `data/eval/task_2`. Each evaluation se

This task will function similarly to Task 2 but all unseen languages in the test set will be replaced with languages with orthographies distinct from all other scripts present in the preceding two tasks. For fairness, script systems will be chosen such that they are functionally similar to the training scripts (e.g. the unseen Arabic languages will be replaced with other Abjad).

Data for Subtask 3 is found in the folder `data/eval/task_3`. Each evaluation set is grouped by orthography system. (i.e. `{latin/cyrillic/abjad}_test.tsv`.) We also provide per language evaluation sets.
Abjad:
- Modern Standard Arabic
- Farsi
- Pashto
- Urdu
- *Hebrew*

Cyrillic:
- Bulgarian
- Macedonian
- Russian
- Ukranian
- *Romanian*

Latin:
- English
- Indonesian
- Spanish
- Tagalog
- *Mongolian*

(N.B. Due to quality concerns, the originally planned Devanagari-based subset was removed from the task. We will release this dataset later for community use.)
Data for Subtask 3 is found in the folder `data/eval/task_3`. Each evaluation set is grouped by orthography system. (i.e. `{latin/cyrillic/abjad}_test.tsv`.) We also provide per language evaluation sets.

## Evaluation
The metric used to rank systems is word error rate (WER), the percentage of words for which the hypothesized transcription sequence does not match the gold transcription. This value, in accordance with common practice, is a decimal value multiplied by 100 (e.g.: 13.53). In the medium- and low-frequency tasks, WER is macro-averaged across all ten languages. We provide two Python scripts for evaluation:
Expand Down Expand Up @@ -67,8 +111,8 @@ For baseline architectures, we are hosting a fork of the City University of New

Model architectures are largely equivalent and are intended to be run on a conventional consumer-level GPU or CPU. Interested participants are welcomed to build off the architectures or utilize other models available in the codebase (see `yoyodyne/README.md` for further details).

All models may be evaluated using the `yoyodyne/examples/baselines/predict.sh` script, requiring only changes to the `arch` flag.
All models may be evaluated using the `yoyodyne/examples/baselines/predict.sh` script, requiring only changes to the `arch` flag.

## Organizers
The task is organized by members of the Computational Linguistics Lab at the [Graduate Center, City University of New York](https://www.gc.cuny.edu/) and the [University of British Columbia]().

Expand Down
200 changes: 200 additions & 0 deletions data/eval/task_1/abjad/abjad_test.tsv
Original file line number Diff line number Diff line change
@@ -1,3 +1,203 @@
توفى t w a f f a
غنم ɣ a n a m
صار sˤ aː r
دهن d a h a n
مرا m a r a
باحث b aː ħ e s
بيض b eː dˤ
فترة f a t r a
كفكف k a f k a f
تحمس t ħ a m m a s
نظف n a dˤ dˤ a f
لبس l a b a s
استقال i s t a q aː l
خصوصا x u sˤ uː sˤ a n
عدسة ʕ a d a s a
صابونة sˤ aː b uː n e
استشار i s t a ʃ aː r
قمل ɡ a m l
أخضر ʔ a x dˤ a r
كوربا k oː r b a
بشتغل b i ʃ t ɣ i l
تحت t a ħ t
طلع tˤ a l l a ʕ
رمى r a m a
كبس k a b a s
تقريبا t a ʔ r iː b a n
تنعش t n a ʃ
شبه ʃ a b a h
وحدة w a ħ d a
تخت t a x t
سريع s a r iː ʕ
مكوى m a k w a
عاند ʕ aː n a d
تغسل t ɣ a s s a l
حمي ħ i m i
اختصر i x t a sˤ a r
كئيب k a ʔ iː b
كب k a b b
مريح m r a j j i ħ
تمن t a m a n
تلاتة t a l aː t e
نمر n i m r
أغرى ʔ a ɣ r a
حق ħ a ʔ ʔ
قوي ɡ a w i
ضايق dˤ aː j a ʔ
زنجبيل z a n ʒ a b iː l
شخصية ʃ a x sˤ i j j a
بنك b a n k
جاهد ʒ aː h a d
قشطة ɡ i ʃ tˤ a
غمس ɣ a m m a s
زحمة z a ħ m a
داق d aː ʔ
أكم ʔ a k a m m
أصبع ʔ u sˤ b a ʕ
أعمى ʔ a ʕ m a
كره k i r i h
باخرة b aː x i r a
ريالة r j aː l a
فرن f u r n
أقلية ʔ a q a l l i j j a
تبرع t b a r r a ʕ
زفت z a f f a t
ناصح n aː sˤ i ħ
درج d a r a ʒ
صليب sˤ a l iː b
بنات b a n aː t
فحم f a ħ m
دفاية d a f f aː j
انضباط i n dˤ i b aː tˤ
حمص ħ a m m a sˤ
بقرة b a ʔ a r a
حفيدة ħ a f iː d a
وقت w a ɡ t
زبدية z i b d i j j a
أعجب ʔ a ʕ ʒ a b
وجع w a ʒ a ʕ
قح ʔ a ħ ħ
عصبي ʕ a sˤ a b i
صيف sˤ eː f
غلي ɣ i l i
بنطلون b a n tˤ a l oː n
القدس i l ʔ u d s
بيبي b eː b i
رز r u z
فطن f i tˤ i n
بخبخ b a x b a x
بلتقي b i l t ʔ i
كيس k iː s
فرحان f a r ħ aː n
لطيف l a tˤ iː f
روح r a w w a ħ
فلفل f i l f i l
ألهى ʔ a l h a
الشام ʃ ʃ aː m
راحة r aː ħ a
قلد ɡ a l l a d
حب ħ a b b
تدخل t d a x x a l
وجبة w a ʒ b a
أغلبية ʔ a ɣ l a b i j j a
فستق f u s t o ʔ
الخليل i l x a l iː l
مشغول m a ʃ ɣ uː l
وعد w a ʕ a d
فرنساوي f r a n s aː w i
محل m a ħ a l l
جزيرة ʒ a z iː r a
دور d a w w a r
بوليس b uː l iː s
تحارب t ħ aː r a b
غيمة ɣ eː m a
نص n u sˤ sˤ
جاع ʒ aː ʕ
جوال ʒ a w w aː l
أمل ʔ a m a l
عرق ʕ i r i ʔ
ملى m a l l a
نحيف n ħ iː f
اختار i x t aː r
هاجر h aː ʒ a r
خرافة x u r aː f a
موضوع m a w dˤ uː ʕ
بذر b i z r
شغيل ʃ a ɣ ɣ iː l
سير s eː r
اعتزل i ʕ t a z a l
نازل n aː z i l
بلف b a l a f
جهد ʒ u h d
بحتفل b i ħ t f i l
جعلك ʒ a ʕ l a k
صدر sˤ a d a r
سدس s u d s
كيلومتر k iː l o m i t r
لحمة l a ħ m a
ربح r i b i ħ
شبع ʃ i b i ʕ
رقيق r ʔ iː ʔ
خمس x a m s
تصالح t sˤ aː l a ħ
نقل n a ɡ a l
بعبش b a ʕ b a ʃ
جدول ʒ a d w a l
مات m aː t
بوط b oː tˤ
قلب ɡ a l a b
فوضى f a w dˤ a
قهوة ʔ a h w a
مكورن m k oː r i n
تنتين t i n t eː n
قرا ɡ a r a
زتون z a t uː n
عمتو ʕ a m m t o
ميدان m iː d aː n
زعتر z a ʕ t a r
انصلب n sˤ a l a b
أخفى ʔ a x f a
طير tˤ iː r
مكتبة m a k t a b a
مفصل m a f sˤ a l
قطعة q i tˤ ʕ a
بطن b a tˤ n
مدفع m a d f a ʕ
تقدم t ɡ a d d a m
رتب r a t t a b
ماضي m aː dˤ i
تحدث t ħ a d d a s
خس x a s s
شجع ʃ a ʒ ʒ a ʕ
رحلة r i ħ l a
يعني j a ʕ n i
أدن ʔ a d d a n
فضى f a dˤ dˤ a
رقص r a ɡ a sˤ
أهلا ʔ a h l a
ضمن dˤ a m a n
تبدل t b a d d a l
سوا s a w a
سباتي s b aː t i
راتب r aː t e b
حامي ħ aː m i
دار d aː r
شاب ʃ a b
كاش k aː ʃ
شيكل ʃ eː k i l
كوع k a w w a ʕ
موز m oː z
قلم ɡ a l a m
قفل ʔ a f f a l
كلية k i l j e
لفت l a f a t
شفي ʃ i f i
أصنصير ʔ a sˤ a n sˤ eː r
نرويج n a r w iː ʒ
بدوي b a d a w i
ببتسم b i b t s i m
بد b i d d
تأخر t ʔ a x x a r
اشتاق i ʃ t aː q a
صحصح sˤ a ħ sˤ a ħ
أنموذج ʔ u n m uː ð a d͡ʒ
Expand Down
Loading