Skip to content

Commit ad679a7

Browse files
Update the documentation comment
1 parent b6bc906 commit ad679a7

File tree

2 files changed

+73
-39
lines changed

2 files changed

+73
-39
lines changed

src/tools/unicode-table-generator/src/main.rs

+73
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,76 @@
1+
//! This implements the core logic of the compression scheme used to compactly
2+
//! encode Unicode properties.
3+
//!
4+
//! We have two primary goals with the encoding: we want to be compact, because
5+
//! these tables often end up in ~every Rust program (especially the
6+
//! grapheme_extend table, used for str debugging), including those for embedded
7+
//! targets (where space is important). We also want to be relatively fast,
8+
//! though this is more of a nice to have rather than a key design constraint.
9+
//! It is expected that libraries/applications which are performance-sensitive
10+
//! to Unicode property lookups are extremely rare, and those that care may find
11+
//! the tradeoff of the raw bitsets worth it. For most applications, a
12+
//! relatively fast but much smaller (and as such less cache-impacting, etc.)
13+
//! data set is likely preferable.
14+
//!
15+
//! We have two separate encoding schemes: a skiplist-like approach, and a
16+
//! compressed bitset. The datasets we consider mostly use the skiplist (it's
17+
//! smaller) but the lowercase and uppercase sets are sufficiently sparse for
18+
//! the bitset to be worthwhile -- for those sets the biset is a 2x size win.
19+
//! Since the bitset is also faster, this seems an obvious choice. (As a
20+
//! historical note, the bitset was also the prior implementation, so its
21+
//! relative complexity had already been paid).
22+
//!
23+
//! ## The bitset
24+
//!
25+
//! The primary idea is that we 'flatten' the Unicode ranges into an enormous
26+
//! bitset. To represent any arbitrary codepoint in a raw bitset, we would need
27+
//! over 17 kilobytes of data per character set -- way too much for our
28+
//! purposes.
29+
//!
30+
//! First, the raw bitset (one bit for every valid `char`, from 0 to 0x10FFFF,
31+
//! not skipping the small 'gap') is associated into words (u64) and
32+
//! deduplicated. On random data, this would be useless; on our data, this is
33+
//! incredibly beneficial -- our data sets have (far) less than 256 unique
34+
//! words.
35+
//!
36+
//! This gives us an array that maps `u8 -> word`; the current algorithm does
37+
//! not handle the case of more than 256 unique words, but we are relatively far
38+
//! from coming that close.
39+
//!
40+
//! With that scheme, we now have a single byte for every 64 codepoints.
41+
//!
42+
//! We further chunk these by some constant N (between 1 and 64 per group,
43+
//! dynamically chosen for smallest size), and again deduplicate and store in an
44+
//! array (u8 -> [u8; N]).
45+
//!
46+
//! The bytes of this array map into the words from the bitset above, but we
47+
//! apply another trick here: some of these words are similar enough that they
48+
//! can be represented by some function of another word. The particular
49+
//! functions chosen are rotation, inversion, and shifting (right).
50+
//!
51+
//! ## The skiplist
52+
//!
53+
//! The skip list arose out of the desire for an even smaller encoding than the
54+
//! bitset -- and was the answer to the question "what is the smallest
55+
//! representation we can imagine?". However, it is not necessarily the
56+
//! smallest, and if you have a better proposal, please do suggest it!
57+
//!
58+
//! This is a relatively straightforward encoding. First, we break up all the
59+
//! ranges in the input data into offsets from each other, essentially a gap
60+
//! encoding. In practice, most gaps are small -- less than u8::MAX -- so we
61+
//! store those directly. We make use of the larger gaps (which are nicely
62+
//! interspersed already) throughout the dataset to index this data set.
63+
//!
64+
//! In particular, each run of small gaps (terminating in a large gap) is
65+
//! indexed in a separate dataset. That data set stores an index into the
66+
//! primary offset list and a prefix sum of that offset list. These are packed
67+
//! into a single u32 (11 bits for the offset, 21 bits for the prefix sum).
68+
//!
69+
//! Lookup proceeds via a binary search in the index and then a straightforward
70+
//! linear scan (adding up the offsets) until we reach the needle, and then the
71+
//! index of that offset is utilized as the answer to whether we're in the set
72+
//! or not.
73+
174
use std::collections::{BTreeMap, HashMap};
275
use std::ops::Range;
376
use ucd_parse::Codepoints;

src/tools/unicode-table-generator/src/raw_emitter.rs

-39
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,3 @@
1-
//! This implements the core logic of the compression scheme used to compactly
2-
//! encode the Unicode character classes.
3-
//!
4-
//! The primary idea is that we 'flatten' the Unicode ranges into an enormous
5-
//! bitset. To represent any arbitrary codepoint in a raw bitset, we would need
6-
//! over 17 kilobytes of data per character set -- way too much for our
7-
//! purposes.
8-
//!
9-
//! We have two primary goals with the encoding: we want to be compact, because
10-
//! these tables often end up in ~every Rust program (especially the
11-
//! grapheme_extend table, used for str debugging), including those for embedded
12-
//! targets (where space is important). We also want to be relatively fast,
13-
//! though this is more of a nice to have rather than a key design constraint.
14-
//! In practice, due to modern processor design these two are closely related.
15-
//!
16-
//! The encoding scheme here compresses the bitset by first deduplicating the
17-
//! "words" (64 bits on all platforms). In practice very few words are present
18-
//! in most data sets.
19-
//!
20-
//! This gives us an array that maps `u8 -> word` (if we ever went beyond 256
21-
//! words, we could go to u16 -> word or have some dual compression scheme
22-
//! mapping into two separate sets; currently this is not dealt with).
23-
//!
24-
//! With that scheme, we now have a single byte for every 64 codepoints. We
25-
//! further group these by some constant N (between 1 and 64 per group), and
26-
//! again deduplicate and store in an array (u8 -> [u8; N]). The constant is
27-
//! chosen to be optimal in bytes-in-memory for the given dataset.
28-
//!
29-
//! The indices into this array represent ranges of 64*16 = 1024 codepoints.
30-
//!
31-
//! This already reduces the top-level array to at most 1,086 bytes, but in
32-
//! practice we usually can encode in far fewer (the first couple Unicode planes
33-
//! are dense).
34-
//!
35-
//! The last byte of this top-level array is pulled out to a separate static
36-
//! and trailing zeros are dropped; this is simply because grapheme_extend and
37-
//! case_ignorable have a single entry in the 896th entry, so this shrinks them
38-
//! down considerably.
39-
401
use crate::fmt_list;
412
use std::collections::{BTreeMap, BTreeSet, HashMap};
423
use std::convert::TryFrom;

0 commit comments

Comments
 (0)