incorrect extended grapheme segmentation #19

stevengj · 2014-12-07T03:05:02Z

The UTF8PROC_CHARBOUND map option is supposed to segment a string into graphemes (by inserting 0xFF before each grapheme), but it doesn't seem to be following the UAX extended grapheme rules. [It might be following the legacy rules? But (a) these aren't recommended and (b) the use of UTF8PROC_BOUNDCLASS_EXTEND in the source code seems to indicated that the extended rules are intended?]

According to UAX#29 and the grapheme break tests provided by the Unicode consortium, it is recommended that most applications use the "extended" graphene break rules. In particular, any codepoint followed by a spacing mark is supposed to be treated as a single grapheme.

For example "\u0020\u0903" is one of the test cases that is supposed to be treated as a single grapheme, because U+0903 is a spacing-combining mark (category Mc). But utf8proc breaks these into two graphemes:

#include <stdio.h>
#include "mojibake.h"

int main(void)
{
     uint8_t s[4] = {0x20,0xe0,0xa4,0x83}; /* UTF-8 for "\u0020\u0903" */
     uint8_t *g = 0;
     ssize_t len, i;
     len = utf8proc_map(s, 4, &g, UTF8PROC_CHARBOUND);
     printf("g[] = [");
     for (i = 0; i < len; ++i) {
          if (i) printf(",");
          printf("%02x", g[i]);
     }
     printf("]\n");
     return 0;
}

which prints g[] = [ff,20,ff,e0,a4,83] (notice the incorrect 0xff breakpoint after the first codepoint 0x20).

The text was updated successfully, but these errors were encountered:

stevengj · 2014-12-07T03:09:42Z

Note that this is relevant to JuliaLang/julia#9261

stevengj · 2014-12-08T02:19:37Z

In response to an email query, Jan Behrens said that utf8proc's grapheme code was written based on version 4.1.0 of Unicode Standard Annex #29, so he would not be surprised if it needs updating for Unicode 7.0.0.

stevengj · 2014-12-08T03:57:44Z

I've put a test program for UAX#29 in the graphemes branch.

stevengj · 2014-12-12T19:26:57Z

Seems like we want to replace the extend bit in utf8proc_property_t with a boundclass (3 bits) from the GraphemeBreakProperty.txt data.

It seems worth reordering some of the utf8proc_property_t fields to save space, since modifying the extend field will break backwards compatibility anyway.

stevengj added the bug label Dec 7, 2014

stevengj changed the title ~~incorrect extended grapheme breaking~~ incorrect extended grapheme breaks Dec 7, 2014

stevengj changed the title ~~incorrect extended grapheme breaks~~ incorrect extended grapheme segmentation Dec 7, 2014

stevengj mentioned this issue Dec 8, 2014

add graphemes(s) function to iterate over string graphemes JuliaLang/julia#9261

Merged

stevengj mentioned this issue Dec 12, 2014

Update graphemes for Unicode 7 #20

Merged

stevengj closed this as completed in #20 Dec 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect extended grapheme segmentation #19

incorrect extended grapheme segmentation #19

stevengj commented Dec 7, 2014

stevengj commented Dec 7, 2014

stevengj commented Dec 8, 2014

stevengj commented Dec 8, 2014

stevengj commented Dec 12, 2014

incorrect extended grapheme segmentation #19

incorrect extended grapheme segmentation #19

Comments

stevengj commented Dec 7, 2014

stevengj commented Dec 7, 2014

stevengj commented Dec 8, 2014

stevengj commented Dec 8, 2014

stevengj commented Dec 12, 2014