Skip to content

tr: imperfect handling of mixed multi-byte UTF-8 character and 8-bit octal operands #365

Open
@andrewliebenow

Description

@andrewliebenow
❯ coreutils printf 'ᚱ \xE1' | ./target/release/tr -d 'ᚱ \341' | bat --plain --show-all
\x9A\xB1

Expected output is an empty string. The first byte of ᚱ is 0xE1 (225, or 341 in octal). tr is being asked to delete "ᚱ", but also, separately the byte 225 ("\341"). There may be more bugs of this kind, where a UTF-8 character operand's leading byte is also present separately as an octal operand.

These bugs are unlikely to be practically significant, and can pretty easily be worked around. For instance:

❯ coreutils printf 'ᚱ \xE1' | ./target/release/tr -d 'ᚱ' | ./target/release/tr -d ' \341' | bat --plain --show-all
# No output

Most implementations of tr can handle only binary data or only UTF-8 data at all, whereas this is a minor limitation in simultaneous binary and UTF-8 processing.

As such, this could probably also be marked as an enhancement instead of a bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions