Skip to content

text_format parsing does not correctly handle non-ascii chars? #729

@cmyr

Description

@cmyr

It's possible I'm misreading this, but I'm running into issues trying to read text-format protobuf files that contain string literals with non-ascii characters. Reading through the source, the following really does not look correct:

pub fn next_byte_value(&mut self) -> LexerResult<u8> {
match self.next_char()? {
'\\' => {
match self.next_char()? {
'\'' => Ok(b'\''),
'"' => Ok(b'"'),
'\\' => Ok(b'\\'),
'a' => Ok(b'\x07'),
'b' => Ok(b'\x08'),
'f' => Ok(b'\x0c'),
'n' => Ok(b'\n'),
'r' => Ok(b'\r'),
't' => Ok(b'\t'),
'v' => Ok(b'\x0b'),
'x' => {
let d1 = self.next_hex_digit()? as u8;
let d2 = self.next_hex_digit()? as u8;
Ok(((d1 << 4) | d2) as u8)
}
d if d >= '0' && d <= '7' => {
let mut r = d as u8 - b'0';
for _ in 0..2 {
match self.next_octal_digit() {
Err(_) => break,
Ok(d) => r = (r << 3) + d as u8,
}
}
Ok(r)
}
// https://github.com/google/protobuf/issues/4562
// TODO: overflow
c => Ok(c as u8),
}
}
'\n' | '\0' => Err(LexerError::IncorrectInput),
// TODO: check overflow
c => Ok(c as u8),

basically this is consuming chars but only ever returning bytes, and it is converting a char (which represents a unicode scalar value) directly into a u8 which will then be interpreted as utf-8; but outside of ascii the integer value of a char does not correspond to the utf-8 encoding of a char. (for instance the char À has a unicode scalar value of 192, but is encoded as 0xC3, 0x80 in utf-8).

I don't think this is hard to fix; you just need to stay in chars the whole time, and avoid converting to bytes. Given that the text_format input is always valid utf-8 (since you parse from &str, which is always valid utf-8) it should not be possible for a string literal to ever not be valid utf-8.

I'm going to go ahead and write a patch for this and PR it preemptively, since I think it should be relatively trivial; will figure out a test case as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions