Proposal: Replacing ICU with ztd.text or encoding_rs #45389

anonrig · 2022-11-09T15:38:30Z

I've been mainly working on the TextDecoder performance gains for the past couple of weeks. It seems that ICU, even though is required for v8 Intl, is slow for UTF-8 encoding & decoding.

I recommend either adding ztd.text or encoding_rs with C++ bindings as a dependency and improving the performance of the TextDecoder & TextEncoder which will improve a lot of applications worldwide.

Deno uses encoding_rs and Bun uses a custom implementation.

Some good references:

The text was updated successfully, but these errors were encountered:

anonrig · 2022-11-09T16:05:01Z

If this is not worth tsc-agenda, please remove it

targos · 2022-11-09T16:20:13Z

I don't know (yet) if it's worth, but it's probably too early to bring this to the tsc-agenda.

Jarred-Sumner · 2022-11-09T20:23:53Z

Small correction: in Bun's case, JSC.WebCore.TextDecoder is the struct name in Bun for TextDecoder. The code for encoding/decoding is mostly custom (no library) other than one case where a function from WebKit is used for copying 8-bit integers into a 16-bit integer array faster

anonrig · 2022-11-09T20:49:13Z

Thanks @Jarred-Sumner. I just updated the description.

Trott · 2022-11-10T05:30:07Z

Possible small positive side effect of doing what is proposed here: There is at least one place in the code where we use ICU for converting between UTF-8 and UTF-16 even though that part of the code doesn't have any internationalization needs. It's blocking #37954, so this could also help with that, I suppose.

bnoordhuis · 2022-11-12T10:22:26Z

encoding_rs is arguably way too big for just encoding UTF-8 to UTF-16. I mean, it's a great library but it does 100x more than what's needed.

I don't know ztd.text well enough to comment but a quick look at its source code suggests it's not exactly small either. It also seems to be very young and untested.

targos · 2022-11-12T10:34:13Z

What about https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/platform/wtf/ ?

bnoordhuis · 2022-11-13T13:47:54Z

I've worked quite a bit with WTF. It's great but not easy to use outside chromium's source tree. It's supposed to be standalone but it has at least a partial dependency on base/.

WebKit's fork is probably even worse, I could never gather up enough courage to even try: https://github.com/WebKit/WebKit/blob/main/Source/WTF

anonrig · 2022-11-20T03:52:33Z

encoding_rs is arguably way too big for just encoding UTF-8 to UTF-16. I mean, it's a great library but it does 100x more than what's needed.

I don't know ztd.text well enough to comment but a quick look at its source code suggests it's not exactly small either. It also seems to be very young and untested.

UTF8 encoding is now fast due to the recent developments. Can we consider encoding_rs for the rest of the encoding types?

bnoordhuis · 2022-11-20T08:45:58Z

What encodings are we talking about? Can you enumerate them?

anonrig · 2022-11-21T22:41:44Z

What encodings are we talking about? Can you enumerate them?

Any Buffer.from(val, ENC_1).toString(ENC_2) as well as TextDecoder (due to performance issues on initializing string_decoder).

bnoordhuis · 2022-11-23T09:27:00Z

I suppose? The tradeoff is performance vs. maintenance. A library like encoding_rs is probably going to be a PITA to build on IBM i or the other tier 2/tier 3 platforms.

Cross-compiling to WASM is an option but means no sharing of code or data. Each thread gets its own copy, and that's probably quite substantial for conversion tables. You'd have to measure it.

anonrig · 2022-11-23T14:23:13Z

I'd be happy to give this a shot. Would anybody want to help/guide me throughout the process?

bnoordhuis · 2022-11-23T17:39:21Z

Happy to field questions. What direction do you plan on taking?

tniessen · 2022-11-24T00:41:20Z

Cross-compiling to WASM is an option

Wouldn't that require copying all inputs and outputs from the JS heap to WebAssembly linear memory? That, plus the performance difference between WebAssembly and native, might diminish any performance benefits that this issue is hoping for.

bnoordhuis · 2022-11-24T00:52:39Z

Copy data in/out: yes, but it may be cheap enough relative to the cost of conversion. Only way to know is to measure.

anonrig · 2022-11-30T05:00:34Z

Bun started using this package for certain paths: https://github.com/simdutf/simdutf

bnoordhuis · 2022-11-30T09:12:18Z

That could work. The fact it has an amalgamation speaks in its favor, otherwise we're on the hook for maintaining a gyp build.

A thing to keep in mind with SIMD is that the numbers can look great in isolation but turn out slower in real-world application. As with all things performance: you have to measure it.

lemire · 2023-02-07T22:06:57Z

Note that simdutf is now part of node.js, so this should make such work more relevant?

ethanresnick · 2023-09-04T02:55:21Z

Am I understanding correctly that TextDecoder now uses simdutf, but Buffer.prototype.toString() does not? If so, is it correct to say that TextDecoder will be much faster? And, in that case, would it be simple to update Buffer to use simdutf as well, or are there behavioral differences (e.g., around handling of invalid utf-8 bytes) that make switch tricky?

anonrig added the tsc-agenda Issues and PRs to discuss during the meetings of the TSC. label Nov 9, 2022

anonrig removed the tsc-agenda Issues and PRs to discuss during the meetings of the TSC. label Nov 9, 2022

anonrig mentioned this issue Nov 14, 2022

Node.js Performance Team Meeting 2022-11-14 nodejs/performance#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Replacing ICU with ztd.text or encoding_rs #45389

Proposal: Replacing ICU with ztd.text or encoding_rs #45389

anonrig commented Nov 9, 2022 •

edited

Loading

anonrig commented Nov 9, 2022

targos commented Nov 9, 2022

Jarred-Sumner commented Nov 9, 2022

anonrig commented Nov 9, 2022

Trott commented Nov 10, 2022

bnoordhuis commented Nov 12, 2022

targos commented Nov 12, 2022

bnoordhuis commented Nov 13, 2022

anonrig commented Nov 20, 2022

bnoordhuis commented Nov 20, 2022

anonrig commented Nov 21, 2022

bnoordhuis commented Nov 23, 2022

anonrig commented Nov 23, 2022

bnoordhuis commented Nov 23, 2022

tniessen commented Nov 24, 2022

bnoordhuis commented Nov 24, 2022

anonrig commented Nov 30, 2022

bnoordhuis commented Nov 30, 2022

lemire commented Feb 7, 2023

ethanresnick commented Sep 4, 2023

Proposal: Replacing ICU with ztd.text or encoding_rs #45389

Proposal: Replacing ICU with ztd.text or encoding_rs #45389

Comments

anonrig commented Nov 9, 2022 • edited Loading

anonrig commented Nov 9, 2022

targos commented Nov 9, 2022

Jarred-Sumner commented Nov 9, 2022

anonrig commented Nov 9, 2022

Trott commented Nov 10, 2022

bnoordhuis commented Nov 12, 2022

targos commented Nov 12, 2022

bnoordhuis commented Nov 13, 2022

anonrig commented Nov 20, 2022

bnoordhuis commented Nov 20, 2022

anonrig commented Nov 21, 2022

bnoordhuis commented Nov 23, 2022

anonrig commented Nov 23, 2022

bnoordhuis commented Nov 23, 2022

tniessen commented Nov 24, 2022

bnoordhuis commented Nov 24, 2022

anonrig commented Nov 30, 2022

bnoordhuis commented Nov 30, 2022

lemire commented Feb 7, 2023

ethanresnick commented Sep 4, 2023

anonrig commented Nov 9, 2022 •

edited

Loading