Skip to content

Commit 98cb59e

Browse files
TimothyGuaddaleax
authored andcommitted
src: revise character width calculation
- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as 0-width - Categorize all spacing marks (Mc) as non-0-width. - Treat soft hyphens (a format character Cf) as non-0-width. - Do not treat all unassigned code points as 0-width; instead, let ICU select the default for that character per UAX #11. - Avoid getting the General_Category of a character multiple times as it is an intensive operation. Refs: http://unicode.org/reports/tr11/ PR-URL: #13918 Reviewed-By: James M Snell <[email protected]>
1 parent b4b27b2 commit 98cb59e

File tree

2 files changed

+54
-5
lines changed

2 files changed

+54
-5
lines changed

src/node_i18n.cc

+23-4
Original file line numberDiff line numberDiff line change
@@ -601,14 +601,33 @@ static void ToASCII(const FunctionCallbackInfo<Value>& args) {
601601
// newer wide characters. wcwidth, on the other hand, uses a fixed
602602
// algorithm that does not take things like emoji into proper
603603
// consideration.
604+
//
605+
// TODO(TimothyGu): Investigate Cc (C0/C1 control codes). Both VTE (used by
606+
// GNOME Terminal) and Konsole don't consider them to be zero-width (see refs
607+
// below), and when printed in VTE it is Narrow. However GNOME Terminal doesn't
608+
// allow it to be input. Linux's PTY terminal prints control characters as
609+
// Narrow rhombi.
610+
//
611+
// TODO(TimothyGu): Investigate Hangul jamo characters. Medial vowels and final
612+
// consonants are 0-width when combined with initial consonants; otherwise they
613+
// are technically Wide. But many terminals (including Konsole and
614+
// VTE/GLib-based) implement all medials and finals as 0-width.
615+
//
616+
// Refs: https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/#combining-characters-and-character-width
617+
// Refs: https://github.com/GNOME/glib/blob/79e4d4c6be/glib/guniprop.c#L388-L420
618+
// Refs: https://github.com/KDE/konsole/blob/8c6a5d13c0/src/konsole_wcwidth.cpp#L101-L223
604619
static int GetColumnWidth(UChar32 codepoint,
605620
bool ambiguous_as_full_width = false) {
606-
if (!u_isdefined(codepoint) ||
607-
u_iscntrl(codepoint) ||
608-
u_getCombiningClass(codepoint) > 0 ||
609-
u_hasBinaryProperty(codepoint, UCHAR_EMOJI_MODIFIER)) {
621+
const auto zero_width_mask = U_GC_CC_MASK | // C0/C1 control code
622+
U_GC_CF_MASK | // Format control character
623+
U_GC_ME_MASK | // Enclosing mark
624+
U_GC_MN_MASK; // Nonspacing mark
625+
if (codepoint != 0x00AD && // SOFT HYPHEN is Cf but not zero-width
626+
((U_MASK(u_charType(codepoint)) & zero_width_mask) ||
627+
u_hasBinaryProperty(codepoint, UCHAR_EMOJI_MODIFIER))) {
610628
return 0;
611629
}
630+
612631
// UCHAR_EAST_ASIAN_WIDTH is the Unicode property that identifies a
613632
// codepoint as being full width, wide, ambiguous, neutral, narrow,
614633
// or halfwidth.

test/parallel/test-icu-stringwidth.js

+31-1
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,43 @@ const assert = require('assert');
1111
const readline = require('internal/readline');
1212

1313
// Test column width
14+
15+
// Ll (Lowercase Letter): LATIN SMALL LETTER A
1416
assert.strictEqual(readline.getStringWidth('a'), 1);
17+
assert.strictEqual(readline.getStringWidth(0x0061), 1);
18+
// Lo (Other Letter)
1519
assert.strictEqual(readline.getStringWidth('丁'), 2);
20+
assert.strictEqual(readline.getStringWidth(0x4E01), 2);
21+
// Surrogate pairs
1622
assert.strictEqual(readline.getStringWidth('\ud83d\udc78\ud83c\udfff'), 2);
1723
assert.strictEqual(readline.getStringWidth('👅'), 2);
24+
// Cs (Surrogate): High Surrogate
25+
assert.strictEqual(readline.getStringWidth('\ud83d'), 1);
26+
// Cs (Surrogate): Low Surrogate
27+
assert.strictEqual(readline.getStringWidth('\udc78'), 1);
28+
// Cc (Control): NULL
29+
assert.strictEqual(readline.getStringWidth(0), 0);
30+
// Cc (Control): BELL
31+
assert.strictEqual(readline.getStringWidth(0x0007), 0);
32+
// Cc (Control): LINE FEED
1833
assert.strictEqual(readline.getStringWidth('\n'), 0);
34+
// Cf (Format): SOFT HYPHEN
35+
assert.strictEqual(readline.getStringWidth(0x00AD), 1);
36+
// Cf (Format): LEFT-TO-RIGHT MARK
37+
// Cf (Format): RIGHT-TO-LEFT MARK
1938
assert.strictEqual(readline.getStringWidth('\u200Ef\u200F'), 1);
20-
assert.strictEqual(readline.getStringWidth(97), 1);
39+
// Cn (Unassigned): Not a character
40+
assert.strictEqual(readline.getStringWidth(0x10FFEF), 1);
41+
// Cn (Unassigned): Not a character (but in a CJK range)
42+
assert.strictEqual(readline.getStringWidth(0x3FFEF), 2);
43+
// Mn (Nonspacing Mark): COMBINING ACUTE ACCENT
44+
assert.strictEqual(readline.getStringWidth(0x0301), 0);
45+
// Mc (Spacing Mark): BALINESE ADEG ADEG
46+
// Chosen as its Canonical_Combining_Class is not 0, but is not a 0-width
47+
// character.
48+
assert.strictEqual(readline.getStringWidth(0x1B44), 1);
49+
// Me (Enclosing Mark): COMBINING ENCLOSING CIRCLE
50+
assert.strictEqual(readline.getStringWidth(0x20DD), 0);
2151

2252
// The following is an emoji sequence. In some implementations, it is
2353
// represented as a single glyph, in other implementations as a sequence

0 commit comments

Comments
 (0)