Abstract
hdf5plugin is a Python package (1) providing a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and (2) enabling their use from the Python programming language with h5py a thin, pythonic wrapper around libHDF5
.
This presentation illustrates how to use hdf5plugin for reading and writing compressed datasets from Python and gives an overview of the different HDF5 compression filters it provides.
It also illustrates how the provided compression filters can be enabled to read compressed datasets from other (non-Python) application.
Finally, it discusses how hdf5plugin manages to distribute the HDF5 plugins for reuse with different libHDF5
.
License: CC-BY 4.0
hdf5plugin
hdf5plugin packages a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and makes them usable from the Python programming language through h5py.
h5py is a thin, pythonic wrapper around HDF5.
Presenter: Thomas VINCENT
European HDF Users Group Summer 2021, July 7-8, 2021
[2]:
from h5glance import H5Glance # Browsing HDF5 files
H5Glance("data.h5")
[2]:
[3]:
import h5py # Pythonic HDF5 wrapper: https://docs.h5py.org/
h5file = h5py.File("data.h5", mode="r") # Open HDF5 file in read mode
data = h5file["/data"][()] # Access HDF5 dataset "/data"
plt.imshow(data); plt.colorbar() # Display data
[3]:
<matplotlib.colorbar.Colorbar at 0x119479358>
[4]:
data = h5file["/compressed_data_bitshuffle_lz4"][()] # Access compressed dataset
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-4-4bb532391a0f> in <module>
----> 1 data = h5file["/compressed_data_bitshuffle_lz4"][()] # Access compressed dataset
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
~/venv/py37env/lib/python3.7/site-packages/h5py/_hl/dataset.py in __getitem__(self, args, new_dtype)
760 mspace = h5s.create_simple(selection.mshape)
761 fspace = selection.id
--> 762 self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
763
764 # Patch up the output for NumPy
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5d.pyx in h5py.h5d.DatasetID.read()
h5py/_proxy.pyx in h5py._proxy.dset_rw()
OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin)
hdf5plugin
usage
Reading compressed datasets
To enable reading compressed datasets not supported by libHDF5
and h5py
: Install hdf5plugin & import it.
[ ]:
%%bash
pip3 install hdf5plugin
Or: conda install -c conda-forge hdf5plugin
[5]:
import hdf5plugin
[6]:
data = h5file["/compressed_data_bitshuffle_lz4"][()] # Access datset
plt.imshow(data); plt.colorbar() # Display data
[6]:
<matplotlib.colorbar.Colorbar at 0x11bc666d8>
[7]:
h5file.close() # Close the HDF5 file
Writing compressed datasets
When writing datasets with h5py
, compression can be specified with: h5py.Group.create_dataset
[8]:
# Create a dataset with h5py without compression
h5file = h5py.File("new_file_uncompressed.h5", mode="w")
h5file.create_dataset("/data", data=data)
h5file.close()
[9]:
# Create a compressed dataset
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
"/compressed_data_bitshuffle_lz4",
data=data,
compression=32008, # bitshuffle/lz4 HDF5 filter identifier
compression_opts=(0, 2) # options: default number of elements/block, enable LZ4
)
h5file.close()
hdf5plugin
provides some helpers to ease dealing with compression filter and options:
[10]:
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
"/compressed_data_bitshuffle_lz4",
data=data,
**hdf5plugin.Bitshuffle() # Or: **hdf5plugin.BitShuffle(lz4=True)
)
h5file.close()
[ ]:
hdf5plugin.Bitshuffle?
[12]:
H5Glance("new_file_bitshuffle_lz4.h5")
[12]:
- compressed_data_bitshuffle_lz4 [📋]: 1969 × 2961 entries, dtype: uint8
[13]:
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="r")
plt.imshow(h5file["/compressed_data_bitshuffle_lz4"][()]); plt.colorbar()
h5file.close()
[14]:
!ls -l new_file*.h5
-rw-r--r-- 1 tvincent staff 4278852 Jul 8 14:25 new_file_bitshuffle_lz4.h5
-rw-r--r-- 1 tvincent staff 5832257 Jul 8 14:24 new_file_uncompressed.h5
HDF5 compression filters
Available through h5py
Compression filters provided by h5py:
Provided by
libhdf5
: “gzip” and eventually “szip” (optional)Bundled with
h5py
: “lzf”
Pre-compression filter: Byte-Shuffle
[ ]:
h5file = h5py.File("new_file_shuffle_gzip.h5", mode="w")
h5file.create_dataset(
"/compressed_data_shuffle_gzip", data=data, shuffle=True, compression="gzip")
h5file.close()
Provided by hdf5plugin
Additional compression filters provided by hdf5plugin
: Bitshuffle, Blosc, FciDecomp, LZ4, ZFP, Zstandard.
6 out of the 25 HDF5 registered filter plugins as of June 2021.
[ ]:
h5file = h5py.File("new_file_blosc.h5", mode="w")
h5file.create_dataset(
"/compressed_data_blosc",
data=data,
**hdf5plugin.Blosc(cname='zlib', clevel=5, shuffle=hdf5plugin.Blosc.SHUFFLE)
)
h5file.close()
General purpose lossless compression
Bitshuffle(nelems=0, lz4=True) (Filter ID 32008): Bit-Shuffle + LZ4
LZ4(nbytes=0) (Filter ID 32004)
Blosc(cname=‘lz4’, clevel=5, shuffle=1) (Filter ID 32001): Based on c-blosc: A blocking, shuffling and lossless compression library.
Pre-compression shuffle: None, Byte-Shuffle, Bit-Shuffle
Compression:
blosclz
,lz4
,lz4hc
,snappy
(optional, requires C++11),zlib
,zstd
Specific compression
FciDecomp() (Filter ID 32018): Based on JPEG-LS:
Optional: requires C++11
Data type:
(u)int8
or(u)int16
Chunk shape: “Image-like”; 2 or 3 dimensions with at least 16 pixels and at most 65535 rows and columns and at most 4 planes for 3D datasets.
ZFP(rate=None, precision=None, accuracy=None, reversible=False, minbits=None, maxbits=None, maxprec=None, minexp=None) (Filter ID 32013): Lossy
Data type:
float32
,float64
,(u)int32
,(u)int64
Chunk shape: must have at most 4 non-unity dimensions
Benchmark
Machine: 6 cores+hyperthreading (Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz)
Filesystem: RAM disk
HDF5 chunk: 1 frame
hdf5plugin
built from sourceDiffraction tomography dataset: 100 frames from http://www.silx.org/pub/pyFAI/pyFAI_UM_2020/data_ID13/kevlar.h5
Equivalent filters
Blosc
includes pre-compression filters and algorithms provided by other HDF5 compression filters:
LZ4()
=>Blosc("lz4", 9)
Zstd()
=>Blosc("zstd", 2)
HDF5 shuffle =>
Blosc
withshuffle=hdf5plugin.Blosc.SHUFFLE
Bitshuffle()
=>Blosc("lz4", 5, hdf5plugin.Blosc.BITSHUFFLE)
…Except for OpenMP support with
Bitshuffle
!
Summary
Having different pre-compression filters and compression algorithms at hand offer different read/write speed versus compression rate (and eventually error rate) trade-offs.
Also to keep in mind availability/compatibility: "gzip"
as included in libHDF5
is the most compatible one (and also "lzf"
as included in h5py
).
Using hdf5plugin
filters with other applications
Note: With notebook, using ! enables running shell commands
[15]:
!h5dump -d /compressed_data_bitshuffle_lz4 -s "0,0" -c "5,10" data.h5
HDF5 "data.h5" {
DATASET "/compressed_data_bitshuffle_lz4" {
DATATYPE H5T_STD_U8LE
DATASPACE SIMPLE { ( 1969, 2961 ) / ( 1969, 2961 ) }
SUBSET {
START ( 0, 0 );
STRIDE ( 1, 1 );
COUNT ( 5, 10 );
BLOCK ( 1, 1 );
DATA {
}
}
}
}
A solution: Set HDF5_PLUGIN_PATH
environment variable to: hdf5plugin.PLUGINS_PATH
[ ]:
# Directory where HDF5 compression filters are stored
hdf5plugin.PLUGINS_PATH
[ ]:
# Retrieve hdf5plugin.PLUGINS_PATH from the command line
!python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"
[19]:
!ls `python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"`
libh5blosc.dylib libh5fcidecomp.dylib libh5zfp.dylib
libh5bshuf.dylib libh5lz4.dylib libh5zstd.dylib
[20]:
# Set HDF5_PLUGIN_PATH environment variable to hdf5plugin.PLUGINS_PATH
!HDF5_PLUGIN_PATH=`python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"` h5dump -d /compressed_data_bitshuffle_lz4 -s "0,0" -c "5,10" data.h5
HDF5 "data.h5" {
DATASET "/compressed_data_bitshuffle_lz4" {
DATATYPE H5T_STD_U8LE
DATASPACE SIMPLE { ( 1969, 2961 ) / ( 1969, 2961 ) }
SUBSET {
START ( 0, 0 );
STRIDE ( 1, 1 );
COUNT ( 5, 10 );
BLOCK ( 1, 1 );
DATA {
(0,0): 53, 52, 53, 54, 54, 55, 55, 56, 56, 57,
(1,0): 49, 50, 54, 55, 53, 54, 55, 56, 56, 58,
(2,0): 50, 51, 54, 54, 53, 55, 56, 57, 58, 57,
(3,0): 51, 54, 55, 54, 54, 55, 56, 57, 58, 59,
(4,0): 53, 55, 54, 54, 56, 56, 58, 57, 57, 58
}
}
}
}
Note: Only works for reading compressed datasets, not for writing!
Insights
The Problem
For reading compressed datasets, compression filters do NOT need information from libHDF5
. They work with the compressed stream.
For writing compressed datasets, some information about the dataset (e.g., data type size) can be needed by the filter (e.g., to shuffle the data). This information is retrieve through libHDF5
C-API (e.g., H5Tget_size).
Access to libHDF5
C-API is needed, but linking compression filters with libHDF5
is cumbersome in a dynamic environment like Python.
On Windows
Symbols from dynamically loaded Python modules and libraries are accessible to others.
Register compression filter at C-level with H5Zregister
(see src/register_win32.c)
On Linux, macos
In Python, symbols from dynamically loaded modules and libraries are NOT visible to others.
Do not link filters with
libHDF5
.Instead, provide some function wrappers to replace
libHDF5
C-API and link the compression filter with those.Those functions call
libHDF5
corresponding functions that are dynamically loaded at runtime.
At runtime, we need to initialize the compression filter to load symbols dynamically from
libHDF5
used byh5py
and use them from the function wrappers.
typedef size_t (* DL_func_H5Tget_size)(hid_t type_id);
static struct { /* Structure storing HDF5 function pointers */
DL_func_H5Tget_size H5Tget_size;
} DL_H5Functions = {NULL};
/* Init wrapper by loading symbols from `libHDF5` */
int init_filter(const char* libname) {
void * handle = dlopen(libname, RTLD_LAZY | RTLD_LOCAL); /*Load libHDF5*/
DL_H5Functions.H5Tget_size = (DL_func_H5Tget_size)dlsym(handle, "H5Tget_size");
}
/* H5Tget_size libHDF5 C-API wrapper*/
size_t H5Tget_size(hid_t type_id) {
if(DL_H5Functions.H5Tget_size != NULL) {
return DL_H5Functions.H5Tget_size(type_id);
} else {
return 0;
}
}
Concluding remark
In the event the HDF5 compression filter API evolves, it would be great to take this into account to ease distribution of compression filters.
A word about hdf5plugin
license
The source code of hdf5plugin
itself is licensed under the MIT license…
It also embeds the source code of the provided compression filters and libraries which are licensed under different open-source licenses (Apache, BSD-2, BSD-3, MIT, Zlib…) and copyrights.
Conlusion
hdf5plugin
provides additional HDF5 compression filters (namely: Bitshuffle
, Blosc
, FciDecomp
, LZ4
, ZFP
, Zstandard
) mainly for use with h5py but not only.
Documentation: http://www.silx.org/doc/hdf5plugin/latest/
Source code repository: https://github.com/silx-kit/hdf5plugin
Credits to the contributors: Thomas Vincent, Armando Sole, @Florian-toll, @fpwg, Jerome Kieffer, @Anthchirp, @mobiusklein, @junyuewang
Partially funded by the PaNOSC EU-project.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 823852.