-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with String::from_utf8 #54845
Comments
Well, I've spent too much time staring at the UTF-8 wikipedia article and the bytes in total. In the end, I've minimized the bytes to the character that throws the error and wrote some code to investigate it. Note I'm using underscores in literals to show utf-8 encoding bits from character value bits. fn main() {
let utf8: Vec<u8> = vec![237, 166, 164];
let utf8_b: Vec<u8> = vec![
0b1110_1101,
0b10_100110,
0b10_100100,
];
{
let codepoint = 0b1101_100110_100100;
println!("U+{:X}", codepoint);
}
assert_eq!(utf8, utf8_b);
println!("{:?}", String::from_utf8(utf8));
} The codepoint it prints out is U+D9A4 which is "invalid since they are reserved for UTF-16 surrogate halves". Strings in JavaScript are UTF-16, so it makes sense they would have surrogate half codepoints in them. |
Thanks @Havvy. So do I understand this correctly that the byte sequence |
Ok, I think I understand it now: It is a valid utf8 encoding of a unicode code point, but the code point happens to not be a valid unicode scalar value (which is what rust Follow-up question (if you don't mind me derailing this): Wikipedia says:
Does that mean that nodejs decoding a buffer of those utf8 bytes into a string without complaining is a unicode violation? |
And for any future travelers: This is what wtf-8 is all about. |
To confirm with a quote more authoritative than Wikipedia or what happens to be in an implementation: https://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G31703
Conformance condition C10:
|
However, an ill-formed byte sequence does not necessarily make the entire decoding fail. That’s one conforming handling (and what |
By the way, � / |
Yeah, I got confused regarding Thank you both so much for taking the time to help clarify this. |
The byte sequence
[34, 228, 166, 164, 110, 237, 166, 164, 44, 34]
("䦤n���,"
, quotes are part of the string itself) is considered valid utf8 by ECMAScript (or at least nodejs and firefox), but not by the rust std library.Not knowing enough about unicode and utf8, I'm just assuming that rust is doing this incorrectly, since both v8 and spidermonkey accept it as valid utf8.
JSON.parse('"䦤n���,"')
in javascript returns a string, whereas in rust:rustc --version --verbose
The text was updated successfully, but these errors were encountered: