-
Notifications
You must be signed in to change notification settings - Fork 393
Description
It's possible I'm misreading this, but I'm running into issues trying to read text-format protobuf files that contain string literals with non-ascii characters. Reading through the source, the following really does not look correct:
rust-protobuf/protobuf-support/src/lexer/lexer_impl.rs
Lines 443 to 479 in 16c9dc5
pub fn next_byte_value(&mut self) -> LexerResult<u8> { | |
match self.next_char()? { | |
'\\' => { | |
match self.next_char()? { | |
'\'' => Ok(b'\''), | |
'"' => Ok(b'"'), | |
'\\' => Ok(b'\\'), | |
'a' => Ok(b'\x07'), | |
'b' => Ok(b'\x08'), | |
'f' => Ok(b'\x0c'), | |
'n' => Ok(b'\n'), | |
'r' => Ok(b'\r'), | |
't' => Ok(b'\t'), | |
'v' => Ok(b'\x0b'), | |
'x' => { | |
let d1 = self.next_hex_digit()? as u8; | |
let d2 = self.next_hex_digit()? as u8; | |
Ok(((d1 << 4) | d2) as u8) | |
} | |
d if d >= '0' && d <= '7' => { | |
let mut r = d as u8 - b'0'; | |
for _ in 0..2 { | |
match self.next_octal_digit() { | |
Err(_) => break, | |
Ok(d) => r = (r << 3) + d as u8, | |
} | |
} | |
Ok(r) | |
} | |
// https://github.com/google/protobuf/issues/4562 | |
// TODO: overflow | |
c => Ok(c as u8), | |
} | |
} | |
'\n' | '\0' => Err(LexerError::IncorrectInput), | |
// TODO: check overflow | |
c => Ok(c as u8), |
basically this is consuming chars
but only ever returning bytes, and it is converting a char
(which represents a unicode scalar value) directly into a u8
which will then be interpreted as utf-8
; but outside of ascii the integer value of a char
does not correspond to the utf-8 encoding of a char
. (for instance the char À
has a unicode scalar value of 192, but is encoded as 0xC3, 0x80
in utf-8).
I don't think this is hard to fix; you just need to stay in chars
the whole time, and avoid converting to bytes. Given that the text_format input is always valid utf-8 (since you parse from &str
, which is always valid utf-8) it should not be possible for a string literal to ever not be valid utf-8.
I'm going to go ahead and write a patch for this and PR it preemptively, since I think it should be relatively trivial; will figure out a test case as well.