SAA: use RTF transcriptions as canonical IPA, gated by IPA validation#16
Open
arunasrivastava wants to merge 1 commit into
Open
SAA: use RTF transcriptions as canonical IPA, gated by IPA validation#16arunasrivastava wants to merge 1 commit into
arunasrivastava wants to merge 1 commit into
Conversation
7701139 to
620624d
Compare
Speech Accent Archive samples now derive `ipa` from the cleaned RTF
transcription files instead of the TextGrid phone tiers when available.
TextGrids are still used for timestamps, audio alignment, and text.
Adds a small pure-Python RTF parser (Unicode \uNNNN escapes, cp1252 /
Mac Roman hex escapes, uc-skip fallback handling, last-bracket IPA
extraction with an unbracketed fallback) plus an IPA allowlist
validation gate: RTF IPA replaces the TextGrid IPA only when it parses
cleanly and contains only IPA characters, otherwise it falls back to the
TextGrid-derived IPA and warns. RTF parsing failures can no longer crash
iteration. RTF IPA is skipped when max_phonemes is set (truncating rich
IPA without alignment is unsafe).
Metadata exposes `ipa_source` ("rtf"/"textgrid") and `rtf_transcript_path`
for provenance.
620624d to
290f895
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Speech Accent Archive samples now derive
ipafrom the cleaned RTF transcription files instead of the TextGrid phone tiers when a usable RTF exists, while TextGrids still provide timestamps, audio alignment, and text. A pure-Python RTF parser extracts the bracketed IPA and an allowlist validation gate adopts it only when it parses cleanly and contains only IPA characters — otherwise it falls back to the TextGrid-derived IPA and warns (parse failures can't crash iteration), RTF is skipped entirely whenmax_phonemesis set, and metadata exposesipa_source/rtf_transcript_pathfor provenance. Before merging, note that 41 files hit a legacy 8-bit byte artifact no documented mapping recovers (verified against SIL's ownSIL-IPA93-2001.map) so they fall back — leaving the dataset mixing two IPA conventions (checkipa_source), and for RTF-sourced samplesipano longer matchestimestamped_phonemes.