Optimizing Image Data Management in Deep Learning Workflows
Hdf5 files for large image datasets

Introduction
Modified steps
A practical example
Some issues and the fix
Other useful resources

Introduction

A simple search on duckduckgo yields a number of tutorials on creating hdf5 files using python package h5py. The common approach involves the following steps: 1. Read the image using PIL package (you can use your favorite package instead of PIL), 2. Convert it to numpy array and, 3. Store in hdf5 file using create_dataset or you can do fancy things like groups and subgroups.

import h5py
import numpy as np
import os
from PIL import Image
save_path = './numpy.hdf5'
img_path = '1.jpeg'
print('image size: %d bytes'%os.path.getsize(img_path))
hf = h5py.File(save_path, 'a') # open a hdf5 file
img_np = np.array(Image.open(img_path))

dset = hf.create_dataset('default', data=img_np)  # write the data to hdf5 file
hf.close()  # close the hdf5 file
print('hdf5 file size: %d bytes'%os.path.getsize(save_path))

image size: 23986 bytes
hdf5 file size: 434270 bytes

While these steps work well for small datasets, the size of the HDF5 file can grow significantly as the number of images increases. In some cases, I’ve observed the HDF5 file taking up to 100 times more space than the original dataset. This occurs because NumPy arrays require more storage than the original image files. If the server has storage limitations, you may want to consider the alternative steps outlined below.

Modified steps

Read the image file as python binary file and to read back the image as numpy array. The following scripts will help you to achieve this.

import h5py
import numpy as np
import os

save_path = './test.hdf5'
img_path = '1.jpeg'
print('image size: %d bytes'%os.path.getsize(img_path))
hf = h5py.File(save_path, 'a') # open a hdf5 file
with open(img_path, 'rb') as img_f:
     binary_data = img_f.read()      # read the image as python binary

binary_data_np = np.asarray(binary_data)
dset = hf.create_dataset('default', data=binary_data_np)  # write the data to hdf5 file
hf.close()  # close the hdf5 file

print('hdf5 file size: %d bytes'%os.path.getsize(save_path))

image size: 23986 bytes
hdf5 file size: 26034 bytes

import h5py
import numpy as np
import io
from PIL import Image

hdf5_file = './test.hdf5'

hf = h5py.File(hdf5_file, 'r') # open a hdf5 file

key = list(hf.keys())[0]
print("Keys: %s" % key)

data = np.array(hf[key])   # write the data to hdf5 file
img = Image.open(io.BytesIO(data))
print('image size:', img.size)
hf.close()  # close the hdf5 file

img.show()

Keys: default
image size: (502, 287)

Now it is clear that the modified steps consume less storage space and also recovers the image as required.

A practical example

Consider the following file structure. I have many such As' in my dataset.

|
*
A
*
|
|
*---a
    |
    *---b.png
    |
    *---c.png
    |

I need to determine how to store images in an HDF5 file while preserving the file structure. This approach will allow me to load images from the HDF5 file along with their respective paths. The following steps are specifically designed to meet my requirements but can be customized to suit other use cases.

import h5py
import numpy as np
import os

base_path = './'  # dataset path

save_path = './test.hdf5'  # path to save the hdf5 file

hf = h5py.File(save_path, 'a')  # open the file in append mode

for i in os.listdir(base_path):   # read all the As'
    vid_name = os.path.join(base_path, i)
    grp = hf.create_group(vid_name)  # create a hdf5 group.  each group is one 'A'

    for j in os.listdir(vid_name):  # read all as' inside A
        track = os.path.join(vid_name, j)
        subgrp = grp.create_group(j)  # create a subgroup for the above created group. each small a is one subgroup

  for k in os.listdir(track):   # find all images inside a.
      img_path = os.path.join(track, k)

      with open(img_path, 'rb') as img_f:  # open images as python binary
              binary_data = img_f.read()

      binary_data_np = np.asarray(binary_data)
      dset = subgrp.create_dataset(k, data=binary_data_np) # each subgroup contains all the images.

hf.close()

The question is: how can we retrieve the names of all the groups and subgroups from an HDF5 file? The h5py package provides useful features, such as "visititems," to help access the stored image files. Let’s explore the following steps as a continuation of the previous ones.

data = []  # list all images files full path 'group/subgroup/b.png' for e.g. ./A/a/b.png. These are basically keys to access our image data.

group = [] # list all groups and subgroups in hdf5 file

def func(name, obj):     # function to recursively store all the keys
    if isinstance(obj, h5py.Dataset):
  data1.append(name)
    elif isinstance(obj, h5py.Group):
  group1.append(name)

hf = h5py.File(save_path, 'r')
hf.visititems(func)  # this is the operation we are talking about.

# Now lets read the image files in their proper format to use it for our training.

for j in data:
    kk = np.array(hf[j])
    img = Image.open(io.BytesIO(kk)) # our image file
    print('image size:', img.size)

Some issues and the fix

In PyTorch, I’ve noticed that parallel reading (using num_workers > 1 in the DataLoader) doesn’t work seamlessly when dealing with HDF5 files. However, this issue is straightforward to address with the latest versions of h5py. Although I haven’t tried it personally, the SWMR documentation provides helpful guidance. Initially, combining the PyTorch DataLoader and h5py was challenging, but I found a workaround. There might be better solutions that I’m not aware of, but here’s what worked for me: In a typical PyTorch DataLoader, the HDF5 file is opened in the __init__() function and read from in __getitem__(). However, when num_workers > 1, this approach fails. The fix is to open the HDF5 file inside the __getitem__() method instead of __init__(). This resolves the issue, allowing the DataLoader to work with multiple workers.

Other useful resources

In Pytorch discussion forum

Optimizing Image Data Management in Deep Learning Workflows Hdf5 files for large image datasets