Abstract

hdf5plugin is a Python package (1) providing a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and (2) enabling their use from the Python programming language with h5py a thin, pythonic wrapper around libHDF5.

This presentation illustrates how to use hdf5plugin for reading and writing compressed datasets from Python and gives an overview of the different HDF5 compression filters it provides.

License: CC-BY 4.0

Restart kernel once the file is created!

[ ]:

import os
os._exit(0)  # Makes the kernel restart

hdf5plugin

hdf5plugin packages a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and makes them usable from the Python programming language through h5py.

h5py is a thin, pythonic wrapper around HDF5.

Presenter: Thomas VINCENT

European HDF5 User Group Meeting 2022, May 31, 2022

[2]:

from h5glance import H5Glance  # Browsing HDF5 files
H5Glance("data.h5")

[2]:

data.h5
- compressed_data [📋]: 1542 × 2500 entries, dtype: uint8
- copyright [📋]: scalar entries, dtype: UTF-8 string
- data [📋]: 1542 × 2500 entries, dtype: uint8

[3]:

import h5py  # Pythonic HDF5 wrapper: https://docs.h5py.org/

h5file = h5py.File("data.h5", mode="r")  # Open HDF5 file in read mode
data = h5file["/data"][()]               # Access HDF5 dataset "/data"
imshow(data)                             # Display data

_images/hdf5plugin_EuropeanHUG2022_9_0.png

[4]:

data = h5file["/compressed_data"][()]  # Access compressed dataset

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 data = h5file["/compressed_data"][()]

File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File ~/venv/ub20.04/lib/python3.8/site-packages/h5py/_hl/dataset.py:741, in Dataset.__getitem__(self, args, new_dtype)
    739 if self._fast_read_ok and (new_dtype is None):
    740     try:
--> 741         return self._fast_reader.read(args)
    742     except TypeError:
    743         pass  # Fall back to Python read pathway below

File h5py/_selector.pyx:370, in h5py._selector.Reader.read()

OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin)

[5]:

# Check dataset's filters
plist = h5file["/compressed_data"].id.get_create_plist()
plist.get_filter(0)[0::3]

[5]:

(32001, b'blosc')

`hdf5plugin` usage

Reading compressed datasets

To enable reading compressed datasets not supported by libHDF5 and h5py: Install hdf5plugin & import it.

[ ]:

%%bash
pip3 install hdf5plugin

Or: conda install -c conda-forge hdf5plugin

[6]:

import hdf5plugin

[7]:

data = h5file["/compressed_data"][()]  # Access datset
imshow(data)                           # Display data

_images/hdf5plugin_EuropeanHUG2022_17_0.png

[8]:

h5file.close()  # Close the HDF5 file

Writing compressed datasets

When writing datasets with h5py, compression can be specified with: h5py.Group.create_dataset

[9]:

# Create a dataset with h5py without compression
h5file = h5py.File("new_file_uncompressed.h5", mode="w")
h5file.create_dataset("/data", data=data)
h5file.close()

[10]:

# Create a compressed dataset
h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
    "/compressed_data",
    data=data,
    compression=32001,  # blosc HDF5 filter identifier
    # options: 0, 0, 0, 0, level, shuffle, compression
    compression_opts=(0, 0, 0, 0, 5, 2, 1)
)
h5file.close()

hdf5plugin provides some helpers to ease dealing with compression filter and options:

[11]:

h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
    "/compressed_data",
    data=data,
    **hdf5plugin.Blosc(
        cname='lz4',
        clevel=5,
        shuffle=hdf5plugin.Blosc.BITSHUFFLE),
)
h5file.close()

[12]:

help(hdf5plugin.Blosc)

Help on class Blosc in module hdf5plugin:

class Blosc(h5py._hl.filters.FilterRefBase)
 |  Blosc(cname='lz4', clevel=5, shuffle=1)
 |
 |  ``h5py.Group.create_dataset``'s compression arguments for using blosc filter.
 |
 |  It can be passed as keyword arguments:
 |
 |  .. code-block:: python
 |
 |      f = h5py.File('test.h5', 'w')
 |      f.create_dataset(
 |          'blosc_byte_shuffle_blosclz',
 |          data=numpy.arange(100),
 |          **hdf5plugin.Blosc(cname='blosclz', clevel=9, shuffle=hdf5plugin.Blosc.SHUFFLE))
 |      f.close()
 |
 |  :param str cname:
 |      `blosclz`, `lz4` (default), `lz4hc`, `zlib`, `zstd`
 |      Optional: `snappy`, depending on compilation (requires C++11).
 |  :param int clevel:
 |      Compression level from 0 (no compression) to 9 (maximum compression).
 |      Default: 5.
 |  :param int shuffle: One of:
 |      - Blosc.NOSHUFFLE (0): No shuffle
 |      - Blosc.SHUFFLE (1): byte-wise shuffle (default)
 |      - Blosc.BITSHUFFLE (2): bit-wise shuffle
 |
 |  Method resolution order:
 |      Blosc
 |      h5py._hl.filters.FilterRefBase
 |      collections.abc.Mapping
 |      collections.abc.Collection
 |      collections.abc.Sized
 |      collections.abc.Iterable
 |      collections.abc.Container
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __init__(self, cname='lz4', clevel=5, shuffle=1)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |
 |  BITSHUFFLE = 2
 |
 |  NOSHUFFLE = 0
 |
 |  SHUFFLE = 1
 |
 |  __abstractmethods__ = frozenset()
 |
 |  filter_id = 32001
 |
 |  ----------------------------------------------------------------------
 |  Methods inherited from h5py._hl.filters.FilterRefBase:
 |
 |  __eq__(self, other)
 |      Return self==value.
 |
 |  __getitem__(self, item)
 |
 |  __hash__(self)
 |      Return hash(self).
 |
 |  __iter__(self)
 |
 |  __len__(self)
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from h5py._hl.filters.FilterRefBase:
 |
 |  __dict__
 |      dictionary for instance variables (if defined)
 |
 |  __weakref__
 |      list of weak references to the object (if defined)
 |
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from h5py._hl.filters.FilterRefBase:
 |
 |  filter_options = ()
 |
 |  ----------------------------------------------------------------------
 |  Methods inherited from collections.abc.Mapping:
 |
 |  __contains__(self, key)
 |
 |  get(self, key, default=None)
 |      D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.
 |
 |  items(self)
 |      D.items() -> a set-like object providing a view on D's items
 |
 |  keys(self)
 |      D.keys() -> a set-like object providing a view on D's keys
 |
 |  values(self)
 |      D.values() -> an object providing a view on D's values
 |
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from collections.abc.Mapping:
 |
 |  __reversed__ = None
 |
 |  ----------------------------------------------------------------------
 |  Class methods inherited from collections.abc.Collection:
 |
 |  __subclasshook__(C) from abc.ABCMeta
 |      Abstract classes can override this to customize issubclass().
 |
 |      This is invoked early on by abc.ABCMeta.__subclasscheck__().
 |      It should return True, False or NotImplemented.  If it returns
 |      NotImplemented, the normal algorithm is used.  Otherwise, it
 |      overrides the normal algorithm (and the outcome is cached).

[13]:

H5Glance("new_file_blosc_bitshuffle_lz4.h5")

[13]:

new_file_blosc_bitshuffle_lz4.h5
- compressed_data [📋]: 1542 × 2500 entries, dtype: uint8

[14]:

h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="r")
imshow(h5file["/compressed_data"][()])
h5file.close()

_images/hdf5plugin_EuropeanHUG2022_26_0.png

[1]:

!ls -sh new_file*.h5

3.4M new_file_blosc_bitshuffle_lz4.h5  3.7M new_file_uncompressed.h5

HDF5 compression filters

Available through `h5py`

Compression filters provided by h5py:

Provided by libhdf5: “gzip” and eventually “szip” (optional)
Bundled with h5py: “lzf”

Pre-compression filter: Byte-Shuffle

[16]:

h5file = h5py.File("new_file_shuffle_gzip.h5", mode="w")
h5file.create_dataset(
    "/compressed_data_shuffle_gzip", data=data, shuffle=True, compression="gzip")
h5file.close()

Provided by `hdf5plugin`

Additional compression filters provided by hdf5plugin: Bitshuffle, Blosc, FciDecomp, LZ4, ZFP, Zstandard.

6 out of the 28 HDF5 registered filter plugins as of May 2022.

[17]:

h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
    "/compressed_data_bitshuffle_lz4",
    data=data,
    **hdf5plugin.Bitshuffle()
)
h5file.close()

General purpose lossless compression

Bitshuffle(nelems=0, lz4=True) (Filter ID 32008): Bit-Shuffle + LZ4
LZ4(nbytes=0) (Filter ID 32004)
Zstd(clevel=3) (Filter ID 32015)
Blosc(cname=‘lz4’, clevel=5, shuffle=1) (Filter ID 32001): Based on c-blosc: A blocking, shuffling and lossless compression library.
- Pre-compression shuffle: None, Byte-Shuffle, Bit-Shuffle
- Compression: blosclz, lz4, lz4hc, snappy (optional, requires C++11), zlib, zstd

Equivalent filters

Blosc includes pre-compression filters and algorithms provided by other HDF5 compression filters:

HDF5 shuffle => Blosc(..., shuffle=Blosc.SHUFFLE)
Bitshuffle() => Blosc("lz4", 5, Blosc.BITSHUFFLE)
LZ4() => Blosc("lz4", 9)
Zstd() => Blosc("zstd", 2)

Specific compression

FciDecomp() (Filter ID 32018): Based on JPEG-LS:
- Optional: requires C++11
- Data type: (u)int8 or (u)int16
- Chunk shape: “Image-like”; 2 or 3 dimensions with at least 16 pixels and at most 65535 rows and columns and at most 4 planes for 3D datasets.
ZFP(rate=None, precision=None, accuracy=None, reversible=False, minbits=None, maxbits=None, maxprec=None, minexp=None) (Filter ID 32013): Lossy
- Data type: float32, float64, (u)int32, (u)int64
- Chunk shape: must have at most 4 non-unity dimensions

A look at performances on a single use case

Machine: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (40 cores)
Filesystem: /dev/shm
hdf5plugin built from source
Running on a single thread (with OMP_NUM_THREADS=1)
Diffraction tomography dataset: 100 frames from http://www.silx.org/pub/pyFAI/pyFAI_UM_2020/data_ID13/kevlar.h5
Dataset: 100x2167x2070, uint16, chunk: 2167x2070

Benchmark

Multithreaded filter execution

Some filters can use multithreading:

Blosc:
- Using a pool of threads
- Disabled by default
- Configurable with the BLOSC_NTHREADS environment variable
Bitshuffle, ZFP:
- Using OpenMP
- Enabled at compilation time
- If enabled, configurable with OMP_NUM_THREADS environment variable

Summary

Having different pre-compression filters and compression algorithms at hand offer different read/write speed versus compression rate (and eventually error rate) trade-offs.

Also to keep in mind availability/compatibility: "gzip" as included in libHDF5 is the most compatible one (and also "lzf" as included in h5py).

Using `hdf5plugin` filters with other applications

Set the HDF5_PLUGIN_PATH environment variable to: hdf5plugin.PLUGINS_PATH

[18]:

%%bash
export HDF5_PLUGIN_PATH=`python3 -c "
import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"`
echo "HDF5_PLUGIN_PATH=${HDF5_PLUGIN_PATH}"
ls ${HDF5_PLUGIN_PATH}

HDF5_PLUGIN_PATH=/venv/ub20.04/lib/python3.8/site-packages/hdf5plugin/plugins
libh5blosc.so
libh5bshuf.so
libh5fcidecomp.so
libh5lz4.so
libh5zfp.so
libh5zstd.so

Note: Only works for reading compressed datasets, not for writing!

A word about `hdf5plugin` license

The source code of hdf5plugin itself is licensed under the MIT license…

It also embeds the source code of the provided compression filters and libraries which are licensed under different open-source licenses (Apache, BSD-2, BSD-3, MIT, Zlib…) and copyrights.

Limitations

Some limitations of current HDF5 compression filters: - Compressed data accessed by “chunks” even if compressor uses smaller blocks - Multi-threaded access - When reading compressed data, some memory copy could be spared - Need to link filters with libhdf5 - Only “gzip” available by default and no central repository for registered filters

Comments

Direct chunk access offers a way to improve performance/flexibility, at the expense of more code on the user side
hdf5plugin relies on a “hack” to ease the installation of HDF5 compression for Python environments
Most of the compression filters provided by hdf5plugin are included in blosc (or blosc-2)

Idea: HDF5+Blosc(2)?

Time for an upgrade of compression support in HDF5?
What about making blosc(2) available by default in libhdf5?

Conlusion

hdf5plugin provides additional HDF5 compression filters (namely: Bitshuffle, Blosc, FciDecomp, LZ4, ZFP, Zstandard) mainly for use with h5py.

Packaged for pip and conda
Documentation: http://www.silx.org/doc/hdf5plugin/latest/
Source code repository: https://github.com/silx-kit/hdf5plugin

Credits to hdf5plugin contributors: Thomas Vincent, Armando Sole, Mark Kittisopikul, @Florian-toll, Jerome Kieffer, @fpwg, @Anthchirp, @mobiusklein, @junyuewang and to all contributors of embedded libraries.

Partially funded by the PaNOSC EU-project.

f338f02adf3646d68a3711e0d333f04e This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 823852.