Skip to content

Add CJK IPA Tokenizer #367

Open
Open
@gkielian

Description

@gkielian

Currently creating some modifications to the Japanese to IPA tokenizer, noticing there are still a few hiragana types that we'll need to map:

Image

Also switching from csv library to pandas library since the former appears to reach its limit when parsing longer tsv fields (might also be because some rows aren't being recognized, possibly not in the right format).

in any case there were less than 10% of the file so as a tempfix, will be adding a sed to delete those lines which weren't successfully processed, then we can focus toward marking those rows which have data not being processed into IPA by the present scripts.

So currently working on:

  1. PR with the above changes (pruning rows not parsed by pandas)
  2. Marking (with [[[[[ ]]]]]) those sections of words not parsed
  3. Creating an IPA token list not counting the above words (will sed to remove those brackets), and start collecting ipa symbols for the tokenized dataset.

Note: if the [[[[[]]]]] words turn out to be very common, we can still use the combined IPA tokens and utilize byte fallback for remainer.

Semantic Factorization

Wanted to mention that am looking forward to adding a learned semantic/position-encoding, and here we might be able to add a parallel dataset for the hiragana and kanji types for each of the phonemes.

So the embeddings will look like:

  1. language embedding
  2. (if ja) hiragana
  3. (if ja) kanji
  4. (if zh) tone (finally a way to incorporate tone!)
  5. (if zh) hanzi character (?) or radical x position embeddings
  6. (if ko) hangul glyph (will likely speed up processing of particles)

Doing this will theoretically make it a reversible mapping without adding specialized numeric tokens, and still factorizing targets (reducing overhead for the multi category shadow).

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions