Problem with String::from_utf8 #54845

AljoschaMeyer · 2018-10-05T11:59:24Z

The byte sequence [34, 228, 166, 164, 110, 237, 166, 164, 44, 34] ("䦤n��,", quotes are part of the string itself) is considered valid utf8 by ECMAScript (or at least nodejs and firefox), but not by the rust std library.

Not knowing enough about unicode and utf8, I'm just assuming that rust is doing this incorrectly, since both v8 and spidermonkey accept it as valid utf8.

JSON.parse('"䦤n��,"') in javascript returns a string, whereas in rust:

println!("{:?}", String::from_utf8(vec![34u8, 228, 166, 164, 110, 237, 166, 164, 44, 34]));

> Err(FromUtf8Error { bytes: [34, 228, 166, 164, 110, 237, 166, 164, 44, 34], error: Utf8Error { valid_up_to: 5, error_len: Some(1) } })

rustc --version --verbose

binary: rustc
commit-hash: de3d640f59c4fa4a09faf2a8d6b0a812aaa6d6cb
commit-date: 2018-10-01
host: x86_64-unknown-linux-gnu
release: 1.31.0-nightly
LLVM version: 8.0

The text was updated successfully, but these errors were encountered:

Havvy · 2018-10-05T12:44:09Z

Well, I've spent too much time staring at the UTF-8 wikipedia article and the bytes in total. In the end, I've minimized the bytes to the character that throws the error and wrote some code to investigate it. Note I'm using underscores in literals to show utf-8 encoding bits from character value bits.

fn main() {
    let utf8: Vec<u8> = vec![237, 166, 164];

    let utf8_b: Vec<u8> = vec![
        0b1110_1101,
        0b10_100110,
        0b10_100100,
    ];
    
    {
        let codepoint = 0b1101_100110_100100;
        println!("U+{:X}", codepoint);
    }
    
    assert_eq!(utf8, utf8_b);
    
    println!("{:?}", String::from_utf8(utf8));
}

The codepoint it prints out is U+D9A4 which is "invalid since they are reserved for UTF-16 surrogate halves".

Strings in JavaScript are UTF-16, so it makes sense they would have surrogate half codepoints in them.

AljoschaMeyer · 2018-10-05T12:57:36Z

Thanks @Havvy. So do I understand this correctly that the byte sequence [237, 166, 164] is not valid utf8, but js engines parse them anyways into their internal utf16 representation without checking for validity? Or is the byte sequence valid utf8 but not valid unicode? Is that even possible?

AljoschaMeyer · 2018-10-05T13:06:58Z

Ok, I think I understand it now: It is a valid utf8 encoding of a unicode code point, but the code point happens to not be a valid unicode scalar value (which is what rust chars are).

Follow-up question (if you don't mind me derailing this):

Wikipedia says:

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence.

Does that mean that nodejs decoding a buffer of those utf8 bytes into a string without complaining is a unicode violation?

AljoschaMeyer · 2018-10-05T13:46:20Z

And for any future travelers: This is what wtf-8 is all about.

SimonSapin · 2018-10-05T14:57:05Z

To confirm with a quote more authoritative than Wikipedia or what happens to be in an implementation:

https://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G31703
Unicode Standard, version 11.0.0, section 3.9, definition D92:

UTF-8 encoding form: The Unicode encoding form that assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length, as specified in Table 3-6 and Table 3-7.

Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.

Conformance condition C10:

When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.

SimonSapin · 2018-10-05T15:23:32Z

However, an ill-formed byte sequence does not necessarily make the entire decoding fail. That’s one conforming handling (and what String::from_utf8 does), but another is to replace ill-formed sub-sequences with U+FFFD REPLACEMENT CHARACTER. This is what String::from_utf8_lossy does and what Node’s Buffer.toString('utf8') appear to do.

SimonSapin · 2018-10-05T15:24:51Z

By the way, � / � is what U+FFFD looks like.

AljoschaMeyer · 2018-10-05T15:28:01Z

Yeah, I got confused regarding JSON.parse because I assumed I was copy-pasting ill-formed strings around, when actually I was only pasting U+FFFD.

Thank you both so much for taking the time to help clarify this.

AljoschaMeyer closed this as completed Oct 5, 2018

AljoschaMeyer mentioned this issue Oct 5, 2018

Buffer.toString('utf8') appears to use wtf-8 nodejs/node#23280

Closed

rmanoka mentioned this issue Jun 4, 2019

UTF-8 Error swc-project/swc#390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with String::from_utf8 #54845

Problem with String::from_utf8 #54845

AljoschaMeyer commented Oct 5, 2018

Havvy commented Oct 5, 2018

AljoschaMeyer commented Oct 5, 2018

AljoschaMeyer commented Oct 5, 2018

AljoschaMeyer commented Oct 5, 2018

SimonSapin commented Oct 5, 2018

SimonSapin commented Oct 5, 2018

SimonSapin commented Oct 5, 2018

AljoschaMeyer commented Oct 5, 2018

Problem with String::from_utf8 #54845

Problem with String::from_utf8 #54845

Comments

AljoschaMeyer commented Oct 5, 2018

Havvy commented Oct 5, 2018

AljoschaMeyer commented Oct 5, 2018

AljoschaMeyer commented Oct 5, 2018

AljoschaMeyer commented Oct 5, 2018

SimonSapin commented Oct 5, 2018

SimonSapin commented Oct 5, 2018

SimonSapin commented Oct 5, 2018

AljoschaMeyer commented Oct 5, 2018