spVCF compression tests

To test spVCF's potential to bend the super-linear growth of pVCF files with cohort size N, we revisited chromosome 2 pVCF files for nested subsets of 50,000 exomes from the DiscovEHR cohort sequenced at Regeneron Genetics Center. These pVCF files were generated from GATK HaplotypeCaller gVCF inputs using GLnexus, as described in our preprint. As shown in Table 2 of that manuscript, at N=50K, 96% of the ~270K pVCF sites have alternate allele frequency below 0.1%. Each pVCF cell complements the called genotype with a typical array of QC measures (GT:DP:AD:SB:GQ:PL:RNC) which account for a large majority of the file size.

spVCF has a default "Lossless" sparse encoding mode, and a "Squeeze" mode which discards most QC measures in cells reporting only reference-equivalent reads (AD=*,0), otherwise keeping them. We used spVCF v0.2.3 in both modes to encode the pVCF files for the nested cohorts numbering N=1K,5K,10K,25K,50K and compared the resulting file sizes. All file sizes are reported with generic bgzip compression, irrespective of the encoding. Spreadsheet

The "Squeeze&Decode" series show the squeezed spVCF decoded back to dense pVCF/BCF; this is to let us disentangle the effect of discarding QC measures from the sparse encoding.

We can also render these results as compression ratios:

TODO: Weissman scores

Analysis

The lossless sparse encoding offers a fair ~2X compression alone. This ratio climbs gradually with N, which might become important in the future.
The QC squeezing offers >5X size reduction by itself, with little loss of useful information. This seems like a no-brainer for future pVCF production, with or without sparse encoding.
The sparse encoding of squeezed pVCF further ~doubles the compression, roughly consistent with the lossless ratio.
There is evidence of synergy between the squeezing and sparse encoding, as the Squeeze compression ratio climbs more steeply with N compared to both Squeeze&Decode and Lossless. Squeezing makes the matrix more run-length encodable as N grows and sites become more closely spaced.
At the end spVCF delivers 15X size reduction from 79 GiB to 5.2 GiB for N=50K. The file size scaling is still super-linear but far more gently, so the ratio is expected to climb farther with N.

Using our spreadsheet's regression (i.e. not to be taken seriously), the predicted file size for N=1,000,000 is 8 TiB with vcf.gz and 151 GiB with Squeezed spvcf.gz, a >50X reduction.

Test with 23K WGS

We tested spVCF on a pVCF file representing ~960K SNV sites on a 18Mbp segment from ~23K WGS, joint-called with GLnexus from gVCF files generated by BCM-HGSC using xAtlas.

This 111.5 GiB vcf.gz compressed to 58.8 GiB Lossless spvcf.gz (1.9X), 9.9 GiB Squeeze spvcf.gz (11.2X), and 19.7 GiB Squeeze&Decode vcf.gz (5.7X). These ratios are roughly in line with the DiscovEHR WES trends for similar N, suggesting robustness to the WES/WGS setting and the different gVCF variant callers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compression_results.md

compression_results.md

spVCF compression tests

Analysis

Test with 23K WGS

Files

compression_results.md

Latest commit

History

compression_results.md

File metadata and controls

spVCF compression tests

Analysis

Test with 23K WGS