|
| 1 | +//! This implements the core logic of the compression scheme used to compactly |
| 2 | +//! encode Unicode properties. |
| 3 | +//! |
| 4 | +//! We have two primary goals with the encoding: we want to be compact, because |
| 5 | +//! these tables often end up in ~every Rust program (especially the |
| 6 | +//! grapheme_extend table, used for str debugging), including those for embedded |
| 7 | +//! targets (where space is important). We also want to be relatively fast, |
| 8 | +//! though this is more of a nice to have rather than a key design constraint. |
| 9 | +//! It is expected that libraries/applications which are performance-sensitive |
| 10 | +//! to Unicode property lookups are extremely rare, and those that care may find |
| 11 | +//! the tradeoff of the raw bitsets worth it. For most applications, a |
| 12 | +//! relatively fast but much smaller (and as such less cache-impacting, etc.) |
| 13 | +//! data set is likely preferable. |
| 14 | +//! |
| 15 | +//! We have two separate encoding schemes: a skiplist-like approach, and a |
| 16 | +//! compressed bitset. The datasets we consider mostly use the skiplist (it's |
| 17 | +//! smaller) but the lowercase and uppercase sets are sufficiently sparse for |
| 18 | +//! the bitset to be worthwhile -- for those sets the biset is a 2x size win. |
| 19 | +//! Since the bitset is also faster, this seems an obvious choice. (As a |
| 20 | +//! historical note, the bitset was also the prior implementation, so its |
| 21 | +//! relative complexity had already been paid). |
| 22 | +//! |
| 23 | +//! ## The bitset |
| 24 | +//! |
| 25 | +//! The primary idea is that we 'flatten' the Unicode ranges into an enormous |
| 26 | +//! bitset. To represent any arbitrary codepoint in a raw bitset, we would need |
| 27 | +//! over 17 kilobytes of data per character set -- way too much for our |
| 28 | +//! purposes. |
| 29 | +//! |
| 30 | +//! First, the raw bitset (one bit for every valid `char`, from 0 to 0x10FFFF, |
| 31 | +//! not skipping the small 'gap') is associated into words (u64) and |
| 32 | +//! deduplicated. On random data, this would be useless; on our data, this is |
| 33 | +//! incredibly beneficial -- our data sets have (far) less than 256 unique |
| 34 | +//! words. |
| 35 | +//! |
| 36 | +//! This gives us an array that maps `u8 -> word`; the current algorithm does |
| 37 | +//! not handle the case of more than 256 unique words, but we are relatively far |
| 38 | +//! from coming that close. |
| 39 | +//! |
| 40 | +//! With that scheme, we now have a single byte for every 64 codepoints. |
| 41 | +//! |
| 42 | +//! We further chunk these by some constant N (between 1 and 64 per group, |
| 43 | +//! dynamically chosen for smallest size), and again deduplicate and store in an |
| 44 | +//! array (u8 -> [u8; N]). |
| 45 | +//! |
| 46 | +//! The bytes of this array map into the words from the bitset above, but we |
| 47 | +//! apply another trick here: some of these words are similar enough that they |
| 48 | +//! can be represented by some function of another word. The particular |
| 49 | +//! functions chosen are rotation, inversion, and shifting (right). |
| 50 | +//! |
| 51 | +//! ## The skiplist |
| 52 | +//! |
| 53 | +//! The skip list arose out of the desire for an even smaller encoding than the |
| 54 | +//! bitset -- and was the answer to the question "what is the smallest |
| 55 | +//! representation we can imagine?". However, it is not necessarily the |
| 56 | +//! smallest, and if you have a better proposal, please do suggest it! |
| 57 | +//! |
| 58 | +//! This is a relatively straightforward encoding. First, we break up all the |
| 59 | +//! ranges in the input data into offsets from each other, essentially a gap |
| 60 | +//! encoding. In practice, most gaps are small -- less than u8::MAX -- so we |
| 61 | +//! store those directly. We make use of the larger gaps (which are nicely |
| 62 | +//! interspersed already) throughout the dataset to index this data set. |
| 63 | +//! |
| 64 | +//! In particular, each run of small gaps (terminating in a large gap) is |
| 65 | +//! indexed in a separate dataset. That data set stores an index into the |
| 66 | +//! primary offset list and a prefix sum of that offset list. These are packed |
| 67 | +//! into a single u32 (11 bits for the offset, 21 bits for the prefix sum). |
| 68 | +//! |
| 69 | +//! Lookup proceeds via a binary search in the index and then a straightforward |
| 70 | +//! linear scan (adding up the offsets) until we reach the needle, and then the |
| 71 | +//! index of that offset is utilized as the answer to whether we're in the set |
| 72 | +//! or not. |
| 73 | +
|
1 | 74 | use std::collections::{BTreeMap, HashMap};
|
2 | 75 | use std::ops::Range;
|
3 | 76 | use ucd_parse::Codepoints;
|
|
0 commit comments