LZIP LZMA benchmarks Tar archives
LZIP is a top choice for reliable, robust long-term archiving of data. With any compression algorithm, the defaults are often not the best choice for very large datasets as encountered in radio science or geoscience in general.
Lzip for large datasets: LZIP options we’ve used for large datasets (here with file extension .bin
).
Create a file MANIFEST
with a list of files to archive. You can do this manually or with a find
command, like:
find . *.bin > MANIFEST
Create a checksum of the files
sha256sum $(< MANIFEST) > SHA256SUM
Zip up the files into filename.tar.lz
tar cvf - $(< MANIFEST) | plzip -0 > filename.tar.lz
Lzip single large file: create my.bin.lz
.
plzip -k -0 my.big
-k
- do NOT have
lzip
delete the original file
LZIP options: plzip
is the multithreaded version of lzip
that uses all the virtual cores of your CPU, to go at least N times faster when N is the number of physical CPU cores.
This compresses the files down to 30-50 % of their original size while being as fast as possible. See the benchmark observations for that greatly increased CPU time doesn’t help compress much more.
tar -I
didn’t work for lzip: sometimes the -I 'lzip -0'
option of tar
doesn’t have any effect–it uses the -9
option of lzip
regardless..
LZIP benchmarks: for a 106.9 MByte 16-bit software defined radio dataset (a short test file) I found the table below. It’s immediately evident that for large, high-entropy (noisy natural geoscience data) that very low compression settings are appropriate. Similar results for LZMA compression options for large datasets of geoscience auroral video.
It may be possible to tweak further improvements by using dictionary size and match length options, if someone has an extremely large noisy dataset compression problem (e.g. CERN).
Lzip - | Compression ratio | time (seconds) |
---|---|---|
0 | 0.471 | 5.6 |
1 | 0.448 | 18.7 |
2 | 0.447 | 30.8 |
6 | 0.407 | 95.2 |
9 | 0.400 | 116.2 |
Compression of very noisy datasets: why is there often little advantage in noisy geoscience datasets for high compression settings? At the most basic level, lossless compression is about finding redundancies in the files. Self-similarities, autocorrelation, and the like. Nature is an incredibly powerful random number generator–the opposite of what compression algorithms need. In contrast to the high-SNR image and text data used by most of the populace, scientists, and geoscientists in particular have instruments that use a very large dynamic range with high sensitivity. For the radio science and scientific camera domains, this typically means 16-bit high speed ADCs where most of the time, several bits are uniformly zero, and the rest of the bits are highly random, with a slowly changing bias value.
In practical terms, a trivial lossless compression algorithm eliminates those high bits that are so often zero, but even a very advanced lossless algorithm will have trouble getting further compression benefit vs. CPU cycles on typical remote sensing datasets.