LZIP LZMA benchmarks Tar archives

LZIP is a top choice for reliable, robust long-term archiving of data. With any compression algorithm, the defaults are often not the best choice for very large datasets as encountered in radio science or geoscience in general.

Lzip for large datasets: LZIP options we’ve used for large datasets (here with file extension .bin).

Create a file MANIFEST with a list of files to archive. You can do this manually or with a find command, like:

find . *.bin > MANIFEST

Create a checksum of the files

nice sha256sum $(< MANIFEST) > SHA256SUM

Zip up the files into filename.tar.lz

tar cvf - $(< MANIFEST) | plzip -0 > filename.tar.lz

Lzip single large file: create my.bin.lz.

plzip -k -0 my.big
-k
do NOT have lzip delete the original file

LZIP options: plzip is the multithreaded version of lzip that uses all the virtual cores of your CPU, to go at least N times faster when N is the number of physical CPU cores.

This compresses the files down to 30-50 % of their original size while being as fast as possible. See the benchmark observations for that greatly increased CPU time doesn’t help compress much more.

tar -I didn’t work for lzip: for some reason, on my PC, the -I 'lzip -0' option of tar doesn’t have any effect–it uses the -9 option of lzip regardless..

LZIP benchmarks: for a 106.9 MByte 16-bit software defined radio dataset (a short test file) I found the table below. It’s immediately evident that for large, high-entropy (noisy natural geoscience data) that very low compression settings are appropriate. Similar results for LZMA compression options for large datasets of geoscience auroral video.

It may be possible to tweak further improvements by using dictionary size and match length options, if someone has an extremely large noisy dataset compression problem (e.g. CERN).

Lzip - Compression ratio time (seconds)
0 0.471 5.6
1 0.448 18.7
2 0.447 30.8
6 0.407 95.2
9 0.400 116.2

Compression of very noisy datasets: why is there often little advantage in noisy geoscience datasets for high compression settings? At the most basic level, lossless compression is about finding redundancies in the files. Self-similarities, autocorrelation, and the like. Nature is an incredibly powerful random number generator–the opposite of what compression algorithms need. In contrast to the high-SNR image and text data used by most of the populace, scientists, and geoscientists in particular have instruments that use a very large dynamic range with high sensitivity. For the radio science and scientific camera domains (two areas of my expertise), this typically means 16-bit high speed ADCs where most of the time, several bits are uniformly zero, and the rest of the bits are highly random, with a slowly changing bias value.

In practical terms, a trivial lossless compression algorithm eliminates those high bits that are so often zero, but even a very advanced lossless algorithm will have trouble getting further compression benefit vs. CPU cycles on typical remote sensing datasets.