Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer.from incorrectly interprets hex and unicode escapes above 0x7f #40032

Closed
leithouse opened this issue Sep 7, 2021 · 3 comments
Closed
Labels
buffer Issues and PRs related to the buffer subsystem.

Comments

@leithouse
Copy link

Version

14.17.3

Platform

Linux 5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

Buffer

What steps will reproduce the bug?

> process.version
'v14.17.3'
> Buffer.from('\x7f')
<Buffer 7f>
> Buffer.from('\x80')
<Buffer c2 80>
> Buffer.from('\u0080')
<Buffer c2 80>
> Buffer.from('\u{80}')
<Buffer c2 80>
> Buffer.from('\xff')
<Buffer c3 bf>

How often does it reproduce? Is there a required condition?

Always on both Debian Buster and Ubuntu Hirsute Hippo

Buster:
Linux 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux

Hippo:
Linux 5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

What is the expected behavior?

For Buffer.from('\x80') to create a single byte buffer containing 0x80.

What do you see instead?

Buffer.from('\x80') creates a two byte buffer containing 0xc2 80

Additional information

No response

@VoltrexKeyva VoltrexKeyva added the buffer Issues and PRs related to the buffer subsystem. label Sep 7, 2021
@VoltrexKeyva
Copy link
Member

VoltrexKeyva commented Sep 7, 2021

This is intended behavior if you don't use the second parameter of the Buffer.from() method and pass the encoding you want to be used and not actually a bug, this happens because of how the UTF-8 encoding works, digging deep into the encodings and the spec and all that, we get the following tables and their sizes:

In UTF-8:

1 byte:       0 -     7F     (ASCII)
2 bytes:     80 -    7FF     (all European plus some Middle Eastern)
3 bytes:    800 -   FFFF     (multilingual plane incl. the top 1792 and private-use)
4 bytes:  10000 - 10FFFF

In UTF-16:

2 bytes:      0 -   D7FF     (multilingual plane except the top 1792 and private-use )
4 bytes:   D800 - 10FFFF

In UTF-32:

4 bytes:      0 - 10FFFF

@aduh95
Copy link
Contributor

aduh95 commented Sep 7, 2021

You can use the second argument to specify the encoding, so you could do:

Buffer.from('80', 'hex') // <Buffer 80>
Buffer.from('\x80', 'binary') // <Buffer 80>

// Or, without using a string
Buffer.from([0x80]) // <Buffer 80>

@leithouse
Copy link
Author

Aha.

> Buffer.from('\x80','ascii')
<Buffer 80>

Thanks @VoltrexMaster

@aduh95 Thanks for the tip but the use case intersperses strings and bytes, I had just dumbed it down to the simplest form for the example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem.
Projects
None yet
Development

No branches or pull requests

3 participants