Skip to content

SAA: use RTF transcriptions as canonical IPA, gated by IPA validation#16

Open
arunasrivastava wants to merge 1 commit into
mainfrom
fix/saa-rtf-parsing
Open

SAA: use RTF transcriptions as canonical IPA, gated by IPA validation#16
arunasrivastava wants to merge 1 commit into
mainfrom
fix/saa-rtf-parsing

Conversation

@arunasrivastava

@arunasrivastava arunasrivastava commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Speech Accent Archive samples now derive ipa from the cleaned RTF transcription files instead of the TextGrid phone tiers when a usable RTF exists, while TextGrids still provide timestamps, audio alignment, and text. A pure-Python RTF parser extracts the bracketed IPA and an allowlist validation gate adopts it only when it parses cleanly and contains only IPA characters — otherwise it falls back to the TextGrid-derived IPA and warns (parse failures can't crash iteration), RTF is skipped entirely when max_phonemes is set, and metadata exposes ipa_source/rtf_transcript_path for provenance. Before merging, note that 41 files hit a legacy 8-bit byte artifact no documented mapping recovers (verified against SIL's own SIL-IPA93-2001.map) so they fall back — leaving the dataset mixing two IPA conventions (check ipa_source), and for RTF-sourced samples ipa no longer matches timestamped_phonemes.

Speech Accent Archive samples now derive `ipa` from the cleaned RTF
transcription files instead of the TextGrid phone tiers when available.
TextGrids are still used for timestamps, audio alignment, and text.

Adds a small pure-Python RTF parser (Unicode \uNNNN escapes, cp1252 /
Mac Roman hex escapes, uc-skip fallback handling, last-bracket IPA
extraction with an unbracketed fallback) plus an IPA allowlist
validation gate: RTF IPA replaces the TextGrid IPA only when it parses
cleanly and contains only IPA characters, otherwise it falls back to the
TextGrid-derived IPA and warns. RTF parsing failures can no longer crash
iteration. RTF IPA is skipped when max_phonemes is set (truncating rich
IPA without alignment is unsafe).

Metadata exposes `ipa_source` ("rtf"/"textgrid") and `rtf_transcript_path`
for provenance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant