Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect extended grapheme segmentation #19

Closed
stevengj opened this issue Dec 7, 2014 · 4 comments · Fixed by #20
Closed

incorrect extended grapheme segmentation #19

stevengj opened this issue Dec 7, 2014 · 4 comments · Fixed by #20
Labels

Comments

@stevengj
Copy link
Member

stevengj commented Dec 7, 2014

The UTF8PROC_CHARBOUND map option is supposed to segment a string into graphemes (by inserting 0xFF before each grapheme), but it doesn't seem to be following the UAX extended grapheme rules. [It might be following the legacy rules? But (a) these aren't recommended and (b) the use of UTF8PROC_BOUNDCLASS_EXTEND in the source code seems to indicated that the extended rules are intended?]

According to UAX#29 and the grapheme break tests provided by the Unicode consortium, it is recommended that most applications use the "extended" graphene break rules. In particular, any codepoint followed by a spacing mark is supposed to be treated as a single grapheme.

For example "\u0020\u0903" is one of the test cases that is supposed to be treated as a single grapheme, because U+0903 is a spacing-combining mark (category Mc). But utf8proc breaks these into two graphemes:

#include <stdio.h>
#include "mojibake.h"

int main(void)
{
     uint8_t s[4] = {0x20,0xe0,0xa4,0x83}; /* UTF-8 for "\u0020\u0903" */
     uint8_t *g = 0;
     ssize_t len, i;
     len = utf8proc_map(s, 4, &g, UTF8PROC_CHARBOUND);
     printf("g[] = [");
     for (i = 0; i < len; ++i) {
          if (i) printf(",");
          printf("%02x", g[i]);
     }
     printf("]\n");
     return 0;
}

which prints g[] = [ff,20,ff,e0,a4,83] (notice the incorrect 0xff breakpoint after the first codepoint 0x20).

@stevengj stevengj added the bug label Dec 7, 2014
@stevengj
Copy link
Member Author

stevengj commented Dec 7, 2014

Note that this is relevant to JuliaLang/julia#9261

@stevengj stevengj changed the title incorrect extended grapheme breaking incorrect extended grapheme breaks Dec 7, 2014
@stevengj stevengj changed the title incorrect extended grapheme breaks incorrect extended grapheme segmentation Dec 7, 2014
@stevengj
Copy link
Member Author

stevengj commented Dec 8, 2014

In response to an email query, Jan Behrens said that utf8proc's grapheme code was written based on version 4.1.0 of Unicode Standard Annex #29, so he would not be surprised if it needs updating for Unicode 7.0.0.

@stevengj
Copy link
Member Author

stevengj commented Dec 8, 2014

I've put a test program for UAX#29 in the graphemes branch.

@stevengj
Copy link
Member Author

Seems like we want to replace the extend bit in utf8proc_property_t with a boundclass (3 bits) from the GraphemeBreakProperty.txt data.

It seems worth reordering some of the utf8proc_property_t fields to save space, since modifying the extend field will break backwards compatibility anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant