Skip to content

Compression

Kannan Vijayan edited this page Sep 6, 2018 · 3 revisions

This topic and sub-topics cover the details (updated as needed) of the compression implementation.

History

Our original approach was to implement the BinaryAST file format as a simple, byte-level encoding of a pre-order walk of an AST representation of Javascript. We found that this approach, while yielding much smaller overall file sizes after encoding, compressed very poorly with gzip and brotli.

Our standard for final-format-size (post-compression) is to have size-parity or better when comparing against brotli-compressed minified JS representing the same code as the BinaryAST file being encoded.

Our results were showing that after compression with gzip/brotli, the naively encoded BinAST produced larger compressed outputs than the corresponding minified JS compressed using the same methods.

We tried various layer-1 "structural" compression approaches as a pre-pass to a serial compressor. While these methods indeed generated significant space savings prior to compression, their poor serial compression performance still put the final compressed size well outside the reasonable range. We also tried approaches such as splitting the tree data into different streams (by data-type or other classifiers) and then compressing those individually with a serial compressor. This approach yielded savings, but short of the kinds of compression we needed.

Our goal with those approaches was to avoid writing our own entropy coder, and instead leverage existing stream compressors (gzip, brotli) to close the "last mile" of file size. If they had worked, they would have been a viable option.

Shift to Direct Entropy Coding

Frustrated with the performance of gzip and brotli on our data, I experimented with a simple entropy coder. A simple "proof of concept" first stab at simple prediction model yielded significantly better results than our prior methods. This proof of concept is implemented (in somewhat messy code) the binast/binjs-ts codebase.

This initial implementation had several deficiencies.

Firstly, no access to a reflected schema for the AST, and thus no ability to produce precise bounds for emitted values. This forced the proof-of-concept coder to always reserve a small "unknown symbol" possibility at every step, costing a fraction of a bit at every encoded symbol.

Secondly, all predictions were done using the tree context (path from root to element) to predict the probability of a particular encoded symbol being a particular value. While this worked well for tree type elements and booleans, it worked very poorly for string references.

A second, more sophisticated stab at entropy coding symbols yielded results that were nipping at the heels of the highest brotli compression settings. This implementation addressed the two issues above. It used a precise reflected schema to encode the tree, and used a much more precise "local cache" model for predicting string-table references. It also ran in a more sophisticated harness - analyzing a corpus of files to produce amalgamated prediction tables, computing a global string set from the corpus, etc.

This second implementation is in the binast/binast-node repository.

Most of the design decisions in the following sections are being driven by analysis results coming from this second implementation. A number of the example snippets will be pulled from reports generated by that code.

Overall Architecture

The final architecture will likely have the following major components:

Tree Model

See Compression/TreeModel for a semi-formal description of the tree model.

Coding Algorithm

This is a largely "pluggable" piece - it can either be an arithmetic coder or an ANS coder. Their compression performance is equivalent, so higher-level algorithms can be evaluated on either. The CPU-performance of the ANS coder however is much better than arithmetic coding, and ANS should be the preferred entropy coding approach for the final format.

It has a simple API: encode(Symbol, ProbabilityTable). At every symbol, the higher level layer computes a symbol probability table for that step and passes it to the encoder (along with the symbol to encode).

Prediction Model

Different prediction models are used to produce probability distributions with which to encode values within the tree. These models are listed below:

  • Compression/ContextualPredictionModel - a set of probability tables built into the encoder which predicts values in the tree by considering the value's ancestors in the tree.
  • Compression/StringWindows - a set of probability tables built into the encoder which predicts string-reference values in the tree using a fixed-size cache.