-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect extended grapheme segmentation #19
Comments
Note that this is relevant to JuliaLang/julia#9261 |
In response to an email query, Jan Behrens said that utf8proc's grapheme code was written based on version 4.1.0 of Unicode Standard Annex #29, so he would not be surprised if it needs updating for Unicode 7.0.0. |
I've put a test program for UAX#29 in the |
Seems like we want to replace the It seems worth reordering some of the |
The
UTF8PROC_CHARBOUND
map option is supposed to segment a string into graphemes (by inserting 0xFF before each grapheme), but it doesn't seem to be following the UAX extended grapheme rules. [It might be following the legacy rules? But (a) these aren't recommended and (b) the use ofUTF8PROC_BOUNDCLASS_EXTEND
in the source code seems to indicated that the extended rules are intended?]According to UAX#29 and the grapheme break tests provided by the Unicode consortium, it is recommended that most applications use the "extended" graphene break rules. In particular, any codepoint followed by a spacing mark is supposed to be treated as a single grapheme.
For example
"\u0020\u0903"
is one of the test cases that is supposed to be treated as a single grapheme, because U+0903 is a spacing-combining mark (category Mc). But utf8proc breaks these into two graphemes:which prints
g[] = [ff,20,ff,e0,a4,83]
(notice the incorrect0xff
breakpoint after the first codepoint0x20
).The text was updated successfully, but these errors were encountered: