Buffer.from incorrectly interprets hex and unicode escapes above 0x7f #40032

leithouse · 2021-09-07T22:00:23Z

Version

14.17.3

Platform

Linux 5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

Buffer

What steps will reproduce the bug?

> process.version
'v14.17.3'
> Buffer.from('\x7f')
<Buffer 7f>
> Buffer.from('\x80')
<Buffer c2 80>
> Buffer.from('\u0080')
<Buffer c2 80>
> Buffer.from('\u{80}')
<Buffer c2 80>
> Buffer.from('\xff')
<Buffer c3 bf>

How often does it reproduce? Is there a required condition?

Always on both Debian Buster and Ubuntu Hirsute Hippo

Buster:
Linux 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux

Hippo:
Linux 5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

What is the expected behavior?

For Buffer.from('\x80') to create a single byte buffer containing 0x80.

What do you see instead?

Buffer.from('\x80') creates a two byte buffer containing 0xc2 80

Additional information

No response

The text was updated successfully, but these errors were encountered:

VoltrexKeyva · 2021-09-07T22:43:09Z

This is intended behavior if you don't use the second parameter of the Buffer.from() method and pass the encoding you want to be used and not actually a bug, this happens because of how the UTF-8 encoding works, digging deep into the encodings and the spec and all that, we get the following tables and their sizes:

In UTF-8:

1 byte:       0 -     7F     (ASCII)
2 bytes:     80 -    7FF     (all European plus some Middle Eastern)
3 bytes:    800 -   FFFF     (multilingual plane incl. the top 1792 and private-use)
4 bytes:  10000 - 10FFFF

In UTF-16:

2 bytes:      0 -   D7FF     (multilingual plane except the top 1792 and private-use )
4 bytes:   D800 - 10FFFF

In UTF-32:

4 bytes:      0 - 10FFFF

aduh95 · 2021-09-07T22:46:29Z

You can use the second argument to specify the encoding, so you could do:

Buffer.from('80', 'hex') // <Buffer 80>
Buffer.from('\x80', 'binary') // <Buffer 80>

// Or, without using a string
Buffer.from([0x80]) // <Buffer 80>

leithouse · 2021-09-07T22:47:57Z

Aha.

> Buffer.from('\x80','ascii')
<Buffer 80>

Thanks @VoltrexMaster

@aduh95 Thanks for the tip but the use case intersperses strings and bytes, I had just dumbed it down to the simplest form for the example.

VoltrexKeyva added the buffer Issues and PRs related to the buffer subsystem. label Sep 7, 2021

leithouse closed this as completed Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer.from incorrectly interprets hex and unicode escapes above 0x7f #40032

Buffer.from incorrectly interprets hex and unicode escapes above 0x7f #40032

leithouse commented Sep 7, 2021

VoltrexKeyva commented Sep 7, 2021 •

edited

Loading

aduh95 commented Sep 7, 2021 •

edited

Loading

leithouse commented Sep 7, 2021

Buffer.from incorrectly interprets hex and unicode escapes above 0x7f #40032

Buffer.from incorrectly interprets hex and unicode escapes above 0x7f #40032

Comments

leithouse commented Sep 7, 2021

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior?

What do you see instead?

Additional information

VoltrexKeyva commented Sep 7, 2021 • edited Loading

aduh95 commented Sep 7, 2021 • edited Loading

leithouse commented Sep 7, 2021

VoltrexKeyva commented Sep 7, 2021 •

edited

Loading

aduh95 commented Sep 7, 2021 •

edited

Loading