Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Replacing ICU with ztd.text or encoding_rs #45389

Open
anonrig opened this issue Nov 9, 2022 · 20 comments
Open

Proposal: Replacing ICU with ztd.text or encoding_rs #45389

anonrig opened this issue Nov 9, 2022 · 20 comments

Comments

@anonrig
Copy link
Member

anonrig commented Nov 9, 2022

I've been mainly working on the TextDecoder performance gains for the past couple of weeks. It seems that ICU, even though is required for v8 Intl, is slow for UTF-8 encoding & decoding.

I recommend either adding ztd.text or encoding_rs with C++ bindings as a dependency and improving the performance of the TextDecoder & TextEncoder which will improve a lot of applications worldwide.

Deno uses encoding_rs and Bun uses a custom implementation.

Some good references:

@anonrig anonrig added the tsc-agenda Issues and PRs to discuss during the meetings of the TSC. label Nov 9, 2022
@anonrig
Copy link
Member Author

anonrig commented Nov 9, 2022

If this is not worth tsc-agenda, please remove it

@targos
Copy link
Member

targos commented Nov 9, 2022

I don't know (yet) if it's worth, but it's probably too early to bring this to the tsc-agenda.

@anonrig anonrig removed the tsc-agenda Issues and PRs to discuss during the meetings of the TSC. label Nov 9, 2022
@Jarred-Sumner
Copy link

Small correction: in Bun's case, JSC.WebCore.TextDecoder is the struct name in Bun for TextDecoder. The code for encoding/decoding is mostly custom (no library) other than one case where a function from WebKit is used for copying 8-bit integers into a 16-bit integer array faster

@anonrig
Copy link
Member Author

anonrig commented Nov 9, 2022

Thanks @Jarred-Sumner. I just updated the description.

@Trott
Copy link
Member

Trott commented Nov 10, 2022

Possible small positive side effect of doing what is proposed here: There is at least one place in the code where we use ICU for converting between UTF-8 and UTF-16 even though that part of the code doesn't have any internationalization needs. It's blocking #37954, so this could also help with that, I suppose.

@bnoordhuis
Copy link
Member

encoding_rs is arguably way too big for just encoding UTF-8 to UTF-16. I mean, it's a great library but it does 100x more than what's needed.

I don't know ztd.text well enough to comment but a quick look at its source code suggests it's not exactly small either. It also seems to be very young and untested.

@targos
Copy link
Member

targos commented Nov 12, 2022

@bnoordhuis
Copy link
Member

I've worked quite a bit with WTF. It's great but not easy to use outside chromium's source tree. It's supposed to be standalone but it has at least a partial dependency on base/.

WebKit's fork is probably even worse, I could never gather up enough courage to even try: https://github.com/WebKit/WebKit/blob/main/Source/WTF

@anonrig
Copy link
Member Author

anonrig commented Nov 20, 2022

encoding_rs is arguably way too big for just encoding UTF-8 to UTF-16. I mean, it's a great library but it does 100x more than what's needed.

I don't know ztd.text well enough to comment but a quick look at its source code suggests it's not exactly small either. It also seems to be very young and untested.

UTF8 encoding is now fast due to the recent developments. Can we consider encoding_rs for the rest of the encoding types?

@bnoordhuis
Copy link
Member

What encodings are we talking about? Can you enumerate them?

@anonrig
Copy link
Member Author

anonrig commented Nov 21, 2022

What encodings are we talking about? Can you enumerate them?

Any Buffer.from(val, ENC_1).toString(ENC_2) as well as TextDecoder (due to performance issues on initializing string_decoder).

@bnoordhuis
Copy link
Member

I suppose? The tradeoff is performance vs. maintenance. A library like encoding_rs is probably going to be a PITA to build on IBM i or the other tier 2/tier 3 platforms.

Cross-compiling to WASM is an option but means no sharing of code or data. Each thread gets its own copy, and that's probably quite substantial for conversion tables. You'd have to measure it.

@anonrig
Copy link
Member Author

anonrig commented Nov 23, 2022

I'd be happy to give this a shot. Would anybody want to help/guide me throughout the process?

@bnoordhuis
Copy link
Member

Happy to field questions. What direction do you plan on taking?

@tniessen
Copy link
Member

Cross-compiling to WASM is an option

Wouldn't that require copying all inputs and outputs from the JS heap to WebAssembly linear memory? That, plus the performance difference between WebAssembly and native, might diminish any performance benefits that this issue is hoping for.

@bnoordhuis
Copy link
Member

Copy data in/out: yes, but it may be cheap enough relative to the cost of conversion. Only way to know is to measure.

@anonrig
Copy link
Member Author

anonrig commented Nov 30, 2022

Bun started using this package for certain paths: https://github.com/simdutf/simdutf

@bnoordhuis
Copy link
Member

That could work. The fact it has an amalgamation speaks in its favor, otherwise we're on the hook for maintaining a gyp build.

A thing to keep in mind with SIMD is that the numbers can look great in isolation but turn out slower in real-world application. As with all things performance: you have to measure it.

@lemire
Copy link
Member

lemire commented Feb 7, 2023

Note that simdutf is now part of node.js, so this should make such work more relevant?

@ethanresnick
Copy link

Am I understanding correctly that TextDecoder now uses simdutf, but Buffer.prototype.toString() does not? If so, is it correct to say that TextDecoder will be much faster? And, in that case, would it be simple to update Buffer to use simdutf as well, or are there behavioral differences (e.g., around handling of invalid utf-8 bytes) that make switch tricky?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants