Numpy can't read .zip files
ZIP files or GZ files and the like can be quick-and-dirty ways to compress individual data files for retrieval from remote sensors.
In particular, the
GeoRinex
program has extensive capabilities for transparently (without extracting to uncompressed file) reading .zip, .z, .gz, etc. compressed text files, which benefit greatly from storage space savings.
It was surprising to find that transparently processing similarly compressed binary data is not trivial, particularly with numpy.fromfile.
Numpy has
unresolved bugs
with
numpy.fromfile
that preclude easy use with inline reading via
zipfile.ZipFile
or
tarfile.
Specifically, the .fileno
attribute is not available from zipfile or tarfile, and numpy.fromfile() relies on .fileno among other attributes.
numpy.frombuffer is not generally suitable for this application either, because it does not advance the buffer position. We are not saying there’s no way around this situation, but we chose a more generally beneficial path.
Use HDF5
When raw data files need to be compressed and then later analyzed, we use HDF5. Even when the original program writing the raw binary data cannot be modified, a simple post-processing Python script with h5py reads the raw data and converts to lossless compressed HDF5 on the sensor. Then, when the data is analyzed out-of-core processing can be used, or at least the whole file doesn’t have to be read to retrieve data from an arbitrary location in the HDF5 file. This allows getting nearly all of the size and speed advantages of HDF5 without modifying the original program.