Skip to content

Commit 9a3fc10

Browse files
jasnelladdaleax
authored andcommitted
util: implement WHATWG Encoding Standard API
Provide an (initially experimental) implementation of the WHATWG Encoding Standard API (`TextDecoder` and `TextEncoder`). The is the same API implemented on the browser side. By default, with small-icu, only the UTF-8, UTF-16le and UTF-16be decoders are supported. With full-icu enabled, every encoding other than iso-8859-16 is supported. This provides a basic test, but does not include the full web platform tests. Note: many of the web platform tests for this would fail by default because we ship with small-icu by default. A process warning will be emitted on first use to indicate that the API is still experimental. No runtime flag is required to use the feature. Backport-PR-URL: #14585 Backport-Reviewed-By: Anna Henningsen <[email protected]> Refs: https://encoding.spec.whatwg.org/ PR-URL: #13644 Reviewed-By: Timothy Gu <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
1 parent f593960 commit 9a3fc10

12 files changed

+1195
-16
lines changed

doc/api/buffer.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -193,11 +193,12 @@ The character encodings currently supported by Node.js include:
193193

194194
* `'hex'` - Encode each byte as two hexadecimal characters.
195195

196-
*Note*: Today's browsers follow the [WHATWG spec] which aliases both 'latin1'
197-
and ISO-8859-1 to win-1252. This means that while doing something like
198-
`http.get()`, if the returned charset is one of those listed in the WHATWG spec
199-
it's possible that the server actually returned win-1252-encoded data, and
200-
using `'latin1'` encoding may incorrectly decode the characters.
196+
*Note*: Today's browsers follow the [WHATWG Encoding Standard][] which aliases
197+
both 'latin1' and ISO-8859-1 to win-1252. This means that while doing something
198+
like `http.get()`, if the returned charset is one of those listed in the WHATWG
199+
specification it is possible that the server actually returned
200+
win-1252-encoded data, and using `'latin1'` encoding may incorrectly decode the
201+
characters.
201202

202203
## Buffers and TypedArray
203204
<!-- YAML
@@ -2662,7 +2663,6 @@ buf.fill(0);
26622663
console.log(buf);
26632664
```
26642665

2665-
26662666
## Buffer Constants
26672667
<!-- YAML
26682668
added: 8.2.0
@@ -2730,5 +2730,5 @@ This value may depend on the JS engine that is being used.
27302730
[`util.inspect()`]: util.html#util_util_inspect_object_options
27312731
[RFC1345]: https://tools.ietf.org/html/rfc1345
27322732
[RFC4648, Section 5]: https://tools.ietf.org/html/rfc4648#section-5
2733-
[WHATWG spec]: https://encoding.spec.whatwg.org/
2733+
[WHATWG Encoding Standard]: https://encoding.spec.whatwg.org/
27342734
[iterator]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Iteration_protocols

doc/api/util.md

+151
Original file line numberDiff line numberDiff line change
@@ -536,6 +536,156 @@ added: v8.0.0
536536
A Symbol that can be used to declare custom promisified variants of functions,
537537
see [Custom promisified functions][].
538538

539+
### Class: util.TextDecoder
540+
<!-- YAML
541+
added: REPLACEME
542+
-->
543+
544+
> Stability: 1 - Experimental
545+
546+
An implementation of the [WHATWG Encoding Standard][] `TextDecoder` API.
547+
548+
```js
549+
const decoder = new TextDecoder('shift_jis');
550+
let string = '';
551+
let buffer;
552+
while (buffer = getNextChunkSomehow()) {
553+
string += decoder.decode(buffer, { stream: true });
554+
}
555+
string += decoder.decode(); // end-of-stream
556+
```
557+
558+
#### WHATWG Supported Encodings
559+
560+
Per the [WHATWG Encoding Standard][], the encodings supported by the
561+
`TextDecoder` API are outlined in the tables below. For each encoding,
562+
one or more aliases may be used. Support for some encodings is enabled
563+
only when Node.js is using the full ICU data.
564+
565+
##### Encodings Supported By Default
566+
567+
| Encoding | Aliases |
568+
| ----------- | --------------------------------- |
569+
| `'utf8'` | `'unicode-1-1-utf-8'`, `'utf-8'` |
570+
| `'utf-16be'`| |
571+
| `'utf-16le'`| `'utf-16'` |
572+
573+
##### Encodings Requiring Full-ICU
574+
575+
| Encoding | Aliases |
576+
| ----------------- | -------------------------------- |
577+
| `'ibm866'` | `'866'`, `'cp866'`, `'csibm866'` |
578+
| `'iso-8859-2'` | `'csisolatin2'`, `'iso-ir-101'`, `'iso8859-2'`, `'iso88592'`, `'iso_8859-2'`, `'iso_8859-2:1987'`, `'l2'`, `'latin2'` |
579+
| `'iso-8859-3'` | `'csisolatin3'`, `'iso-ir-109'`, `'iso8859-3'`, `'iso88593'`, `'iso_8859-3'`, `'iso_8859-3:1988'`, `'l3'`, `'latin3'` |
580+
| `'iso-8859-4'` | `'csisolatin4'`, `'iso-ir-110'`, `'iso8859-4'`, `'iso88594'`, `'iso_8859-4'`, `'iso_8859-4:1988'`, `'l4'`, `'latin4'` |
581+
| `'iso-8859-5'` | `'csisolatincyrillic'`, `'cyrillic'`, `'iso-ir-144'`, `'iso8859-5'`, `'iso88595'`, `'iso_8859-5'`, `'iso_8859-5:1988'`|
582+
| `'iso-8859-6'` | `'arabic'`, `'asmo-708'`, `'csiso88596e'`, `'csiso88596i'`, `'csisolatinarabic'`, `'ecma-114'`, `'iso-8859-6-e'`, `'iso-8859-6-i'`, `'iso-ir-127'`, `'iso8859-6'`, `'iso88596'`, `'iso_8859-6'`, `'iso_8859-6:1987'` |
583+
| `'iso-8859-7'` | `'csisolatingreek'`, `'ecma-118'`, `'elot_928'`, `'greek'`, `'greek8'`, `'iso-ir-126'`, `'iso8859-7'`, `'iso88597'`, `'iso_8859-7'`, `'iso_8859-7:1987'`, `'sun_eu_greek'` |
584+
| `'iso-8859-8'` | `'csiso88598e'`, `'csisolatinhebrew'`, `'hebrew'`, `'iso-8859-8-e'`, `'iso-ir-138'`, `'iso8859-8'`, `'iso88598'`, `'iso_8859-8'`, `'iso_8859-8:1988'`, `'visual'` |
585+
| `'iso-8859-8-i'` | `'csiso88598i'`, `'logical'` |
586+
| `'iso-8859-10'` | `'csisolatin6'`, `'iso-ir-157'`, `'iso8859-10'`, `'iso885910'`, `'l6'`, `'latin6'` |
587+
| `'iso-8859-13'` | `'iso8859-13'`, `'iso885913'` |
588+
| `'iso-8859-14'` | `'iso8859-14'`, `'iso885914'` |
589+
| `'iso-8859-15'` | `'csisolatin9'`, `'iso8859-15'`, `'iso885915'`, `'iso_8859-15'`, `'l9'` |
590+
| `'koi8-r'` | `'cskoi8r'`, `'koi'`, `'koi8'`, `'koi8_r'` |
591+
| `'koi8-u'` | `'koi8-ru'` |
592+
| `'macintosh'` | `'csmacintosh'`, `'mac'`, `'x-mac-roman'` |
593+
| `'windows-874'` | `'dos-874'`, `'iso-8859-11'`, `'iso8859-11'`, `'iso885911'`, `'tis-620'` |
594+
| `'windows-1250'` | `'cp1250'`, `'x-cp1250'` |
595+
| `'windows-1251'` | `'cp1251'`, `'x-cp1251'` |
596+
| `'windows-1252'` | `'ansi_x3.4-1968'`, `'ascii'`, `'cp1252'`, `'cp819'`, `'csisolatin1'`, `'ibm819'`, `'iso-8859-1'`, `'iso-ir-100'`, `'iso8859-1'`, `'iso88591'`, `'iso_8859-1'`, `'iso_8859-1:1987'`, `'l1'`, `'latin1'`, `'us-ascii'`, `'x-cp1252'` |
597+
| `'windows-1253'` | `'cp1253'`, `'x-cp1253'` |
598+
| `'windows-1254'` | `'cp1254'`, `'csisolatin5'`, `'iso-8859-9'`, `'iso-ir-148'`, `'iso8859-9'`, `'iso88599'`, `'iso_8859-9'`, `'iso_8859-9:1989'`, `'l5'`, `'latin5'`, `'x-cp1254'` |
599+
| `'windows-1255'` | `'cp1255'`, `'x-cp1255'` |
600+
| `'windows-1256'` | `'cp1256'`, `'x-cp1256'` |
601+
| `'windows-1257'` | `'cp1257'`, `'x-cp1257'` |
602+
| `'windows-1258'` | `'cp1258'`, `'x-cp1258'` |
603+
| `'x-mac-cyrillic'`| `'x-mac-ukrainian'` |
604+
| `'gbk'` | `'chinese'`, `'csgb2312'`, `'csiso58gb231280'`, `'gb2312'`, `'gb_2312'`, `'gb_2312-80'`, `'iso-ir-58'`, `'x-gbk'` |
605+
| `'gb18030'` | |
606+
| `'big5'` | `'big5-hkscs'`, `'cn-big5'`, `'csbig5'`, `'x-x-big5'` |
607+
| `'euc-jp'` | `'cseucpkdfmtjapanese'`, `'x-euc-jp'` |
608+
| `'iso-2022-jp'` | `'csiso2022jp'` |
609+
| `'shift_jis'` | `'csshiftjis'`, `'ms932'`, `'ms_kanji'`, `'shift-jis'`, `'sjis'`, `'windows-31j'`, `'x-sjis'` |
610+
| `'euc-kr'` | `'cseuckr'`, `'csksc56011987'`, `'iso-ir-149'`, `'korean'`, `'ks_c_5601-1987'`, `'ks_c_5601-1989'`, `'ksc5601'`, `'ksc_5601'`, `'windows-949'` |
611+
612+
*Note*: The `'iso-8859-16'` encoding listed in the [WHATWG Encoding Standard][]
613+
is not supported.
614+
615+
#### new TextDecoder([encoding[, options]])
616+
617+
* `encoding` {string} Identifies the `encoding` that this `TextDecoder` instance
618+
supports. Defaults to `'utf-8'`.
619+
* `options` {Object}
620+
* `fatal` {boolean} `true` if decoding failures are fatal. Defaults to
621+
`false`.
622+
* `ignoreBOM` {boolean} When `true`, the `TextDecoder` will include the byte
623+
order mark in the decoded result. When `false`, the byte order mark will
624+
be removed from the output. This option is only used when `encoding` is
625+
`'utf-8'`, `'utf-16be'` or `'utf-16le'`. Defaults to `false`.
626+
627+
Creates an new `TextDecoder` instance. The `encoding` may specify one of the
628+
supported encodings or an alias.
629+
630+
#### textDecoder.decode([input[, options]])
631+
632+
* `input` {ArrayBuffer|DataView|TypedArray} An `ArrayBuffer`, `DataView` or
633+
Typed Array instance containing the encoded data.
634+
* `options` {Object}
635+
* `stream` {boolean} `true` if additional chunks of data are expected.
636+
Defaults to `false`.
637+
* Returns: {string}
638+
639+
Decodes the `input` and returns a string. If `options.stream` is `true`, any
640+
incomplete byte sequences occuring at the end of the `input` are buffered
641+
internally and emitted after the next call to `textDecoder.decode()`.
642+
643+
If `textDecoder.fatal` is `true`, decoding errors that occur will result in a
644+
`TypeError` being thrown.
645+
646+
#### textDecoder.encoding
647+
648+
* Value: {string}
649+
650+
The encoding supported by the `TextDecoder` instance.
651+
652+
#### textDecoder.fatal
653+
654+
* Value: {boolean}
655+
656+
The value will be `true` if decoding errors result in a `TypeError` being
657+
thrown.
658+
659+
#### textDecoder.ignoreBOM
660+
661+
* Value: {boolean}
662+
663+
The value will be `true` if the decoding result will include the byte order
664+
mark.
665+
666+
### Class: util.TextEncoder
667+
<!-- YAML
668+
added: REPLACEME
669+
-->
670+
671+
> Stability: 1 - Experimental
672+
673+
An implementation of the [WHATWG Encoding Standard][] `TextEncoder` API. All
674+
instances of `TextEncoder` only support `UTF-8` encoding.
675+
676+
```js
677+
const encoder = new TextEncoder();
678+
const uint8array = encoder.encode('this is some data');
679+
```
680+
681+
#### textEncoder.encode([input])
682+
683+
* `input` {string} The text to encode. Defaults to an empty string.
684+
* Returns: {Uint8Array}
685+
686+
UTF-8 Encodes the `input` string and returns a `Uint8Array` containing the
687+
encoded bytes.
688+
539689
## Deprecated APIs
540690

541691
The following APIs have been deprecated and should no longer be used. Existing
@@ -1022,3 +1172,4 @@ Deprecated predecessor of `console.log`.
10221172
[Custom promisified functions]: #util_custom_promisified_functions
10231173
[constructor]: https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Object/constructor
10241174
[semantically incompatible]: https://github.com/nodejs/node/issues/4179
1175+
[WHATWG Encoding Standard]: https://encoding.spec.whatwg.org/

0 commit comments

Comments
 (0)