Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with String::from_utf8 #54845

Closed
AljoschaMeyer opened this issue Oct 5, 2018 · 8 comments
Closed

Problem with String::from_utf8 #54845

AljoschaMeyer opened this issue Oct 5, 2018 · 8 comments

Comments

@AljoschaMeyer
Copy link
Contributor

The byte sequence [34, 228, 166, 164, 110, 237, 166, 164, 44, 34] ("䦤n���,", quotes are part of the string itself) is considered valid utf8 by ECMAScript (or at least nodejs and firefox), but not by the rust std library.

Not knowing enough about unicode and utf8, I'm just assuming that rust is doing this incorrectly, since both v8 and spidermonkey accept it as valid utf8.

JSON.parse('"䦤n���,"') in javascript returns a string, whereas in rust:

println!("{:?}", String::from_utf8(vec![34u8, 228, 166, 164, 110, 237, 166, 164, 44, 34]));

> Err(FromUtf8Error { bytes: [34, 228, 166, 164, 110, 237, 166, 164, 44, 34], error: Utf8Error { valid_up_to: 5, error_len: Some(1) } })

rustc --version --verbose

binary: rustc
commit-hash: de3d640f59c4fa4a09faf2a8d6b0a812aaa6d6cb
commit-date: 2018-10-01
host: x86_64-unknown-linux-gnu
release: 1.31.0-nightly
LLVM version: 8.0
@Havvy
Copy link
Contributor

Havvy commented Oct 5, 2018

Well, I've spent too much time staring at the UTF-8 wikipedia article and the bytes in total. In the end, I've minimized the bytes to the character that throws the error and wrote some code to investigate it. Note I'm using underscores in literals to show utf-8 encoding bits from character value bits.

fn main() {
    let utf8: Vec<u8> = vec![237, 166, 164];

    let utf8_b: Vec<u8> = vec![
        0b1110_1101,
        0b10_100110,
        0b10_100100,
    ];
    
    {
        let codepoint = 0b1101_100110_100100;
        println!("U+{:X}", codepoint);
    }
    
    assert_eq!(utf8, utf8_b);
    
    println!("{:?}", String::from_utf8(utf8));
}

The codepoint it prints out is U+D9A4 which is "invalid since they are reserved for UTF-16 surrogate halves".

Strings in JavaScript are UTF-16, so it makes sense they would have surrogate half codepoints in them.

@AljoschaMeyer
Copy link
Contributor Author

Thanks @Havvy. So do I understand this correctly that the byte sequence [237, 166, 164] is not valid utf8, but js engines parse them anyways into their internal utf16 representation without checking for validity? Or is the byte sequence valid utf8 but not valid unicode? Is that even possible?

@AljoschaMeyer
Copy link
Contributor Author

Ok, I think I understand it now: It is a valid utf8 encoding of a unicode code point, but the code point happens to not be a valid unicode scalar value (which is what rust chars are).

Follow-up question (if you don't mind me derailing this):

Wikipedia says:

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence.

Does that mean that nodejs decoding a buffer of those utf8 bytes into a string without complaining is a unicode violation?

@AljoschaMeyer
Copy link
Contributor Author

And for any future travelers: This is what wtf-8 is all about.

@SimonSapin
Copy link
Contributor

To confirm with a quote more authoritative than Wikipedia or what happens to be in an implementation:

https://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G31703
Unicode Standard, version 11.0.0, section 3.9, definition D92:

UTF-8 encoding form: The Unicode encoding form that assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length, as specified in Table 3-6 and Table 3-7.

Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.

Conformance condition C10:

When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.

@SimonSapin
Copy link
Contributor

However, an ill-formed byte sequence does not necessarily make the entire decoding fail. That’s one conforming handling (and what String::from_utf8 does), but another is to replace ill-formed sub-sequences with U+FFFD REPLACEMENT CHARACTER. This is what String::from_utf8_lossy does and what Node’s Buffer.toString('utf8') appear to do.

@SimonSapin
Copy link
Contributor

By the way, � / is what U+FFFD looks like.

@AljoschaMeyer
Copy link
Contributor Author

Yeah, I got confused regarding JSON.parse because I assumed I was copy-pasting ill-formed strings around, when actually I was only pasting U+FFFD.

Thank you both so much for taking the time to help clarify this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants