Skip to content

Character tables: separating language-dependent characters from Formatting Characters #127

Open
@behnam

Description

@behnam

This is specially regarding Section A.5 Control characters.

Issues

  1. The table contains many characters that are not language-dependent and depending on the text format (plain text, html, etc) they may appear in text or not. IMHO, we should not expect these characters to be handled correctly in CLDR, and the fact that some of them appear in the CLDR exemplar is not enough to make it a reliable source.

  2. The Bidi Directional Formatting Characters are explicitly defined by UBA (TR9), which is normatively referenced from ALReq in Section 2.3 Direction, therefor no extra source (like CLDR exemplar) is needed to demonstrate needs for these characters.

  3. ZWJ and ZWNJ are exceptions in this list: they are Joining Control characters with their usage in the Arabic script described in Section 2.4 Joining, based on the Unicode Arabic Cursive Joining, another normative reference of ALReq. These characters are expected to be present in the content and in no way a higher protocol is expected to handle them. (NOTE: Maintaining joining during hyphenation, and cases similar to this, are not cases of ZWNJ/ZWJ in action.)

  4. U+FEFF should be explicitly excluded, regardless of what ISIRI spec says about it. It's something deeply related to UTF encodings of Unicode text and has nothing to do with content or anything script-specific.

  5. U+2060 (WORD JOINER), U+2028 (LINE SEPARATOR), and U+2029 (PARAGRAPH SEPARATOR) are the only non-ASCII left-over characters, all in limbo without good documents or corpus supporting them. I recommend to just blacklist them explicitly (similar to U+FEFF, if that also needs to be blacklisted).

  6. And CR/LF are the only ones left, which, again, are talked about in the document, their stats are mixed up, and only add to confusions. (For example, why no U+0009 TAB there?)

  7. If we want to keep CR/LF or U+2028/U+2029, IMHO, we need to have a section about them, explaining their use in line break/paragraph separation, AND anything platform-specific. Again, IMO, that's out of the scope of ALReq.


ACTION-96: Create a github issue regarding table a.5 issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions