Optimizing Image Data Management in Deep Learning Workflows
Hdf5 files for large image datasets
Table of Contents
Introduction
A simple search on duckduckgo yields a number of tutorials on creating hdf5 files using python package h5py. The common approach involves the following steps: 1. Read the image using PIL package (you can use your favorite package instead of PIL), 2. Convert it to numpy array and, 3. Store in hdf5 file using create_dataset or you can do fancy things like groups and subgroups.
import h5py import numpy as np import os from PIL import Image save_path = './numpy.hdf5' img_path = '1.jpeg' print('image size: %d bytes'%os.path.getsize(img_path)) hf = h5py.File(save_path, 'a') # open a hdf5 file img_np = np.array(Image.open(img_path)) dset = hf.create_dataset('default', data=img_np) # write the data to hdf5 file hf.close() # close the hdf5 file print('hdf5 file size: %d bytes'%os.path.getsize(save_path))
image size: 23986 bytes hdf5 file size: 434270 bytes
While these steps work well for small datasets, the size of the HDF5 file can grow significantly as the number of images increases. In some cases, I’ve observed the HDF5 file taking up to 100 times more space than the original dataset. This occurs because NumPy arrays require more storage than the original image files. If the server has storage limitations, you may want to consider the alternative steps outlined below.
Modified steps
Read the image file as python binary file and to read back the image as numpy array. The following scripts will help you to achieve this.
import h5py import numpy as np import os save_path = './test.hdf5' img_path = '1.jpeg' print('image size: %d bytes'%os.path.getsize(img_path)) hf = h5py.File(save_path, 'a') # open a hdf5 file with open(img_path, 'rb') as img_f: binary_data = img_f.read() # read the image as python binary binary_data_np = np.asarray(binary_data) dset = hf.create_dataset('default', data=binary_data_np) # write the data to hdf5 file hf.close() # close the hdf5 file print('hdf5 file size: %d bytes'%os.path.getsize(save_path))
image size: 23986 bytes hdf5 file size: 26034 bytes
import h5py import numpy as np import io from PIL import Image hdf5_file = './test.hdf5' hf = h5py.File(hdf5_file, 'r') # open a hdf5 file key = list(hf.keys())[0] print("Keys: %s" % key) data = np.array(hf[key]) # write the data to hdf5 file img = Image.open(io.BytesIO(data)) print('image size:', img.size) hf.close() # close the hdf5 file img.show()
Keys: default image size: (502, 287)
Now it is clear that the modified steps consume less storage space and also recovers the image as required.
A practical example
Consider the following file structure. I have many such As' in my dataset.
| * A * | | *---a | *---b.png | *---c.png |
I need to determine how to store images in an HDF5 file while preserving the file structure. This approach will allow me to load images from the HDF5 file along with their respective paths. The following steps are specifically designed to meet my requirements but can be customized to suit other use cases.
import h5py import numpy as np import os base_path = './' # dataset path save_path = './test.hdf5' # path to save the hdf5 file hf = h5py.File(save_path, 'a') # open the file in append mode for i in os.listdir(base_path): # read all the As' vid_name = os.path.join(base_path, i) grp = hf.create_group(vid_name) # create a hdf5 group. each group is one 'A' for j in os.listdir(vid_name): # read all as' inside A track = os.path.join(vid_name, j) subgrp = grp.create_group(j) # create a subgroup for the above created group. each small a is one subgroup for k in os.listdir(track): # find all images inside a. img_path = os.path.join(track, k) with open(img_path, 'rb') as img_f: # open images as python binary binary_data = img_f.read() binary_data_np = np.asarray(binary_data) dset = subgrp.create_dataset(k, data=binary_data_np) # each subgroup contains all the images. hf.close()
The question is: how can we retrieve the names of all the groups and subgroups from an HDF5 file? The h5py package provides useful features, such as "visititems," to help access the stored image files. Let’s explore the following steps as a continuation of the previous ones.
data = [] # list all images files full path 'group/subgroup/b.png' for e.g. ./A/a/b.png. These are basically keys to access our image data. group = [] # list all groups and subgroups in hdf5 file def func(name, obj): # function to recursively store all the keys if isinstance(obj, h5py.Dataset): data1.append(name) elif isinstance(obj, h5py.Group): group1.append(name) hf = h5py.File(save_path, 'r') hf.visititems(func) # this is the operation we are talking about. # Now lets read the image files in their proper format to use it for our training. for j in data: kk = np.array(hf[j]) img = Image.open(io.BytesIO(kk)) # our image file print('image size:', img.size)
Some issues and the fix
In PyTorch, I’ve noticed that parallel reading (using num_workers > 1 in the DataLoader) doesn’t work seamlessly when dealing with HDF5 files. However, this issue is straightforward to address with the latest versions of h5py. Although I haven’t tried it personally, the SWMR documentation provides helpful guidance. Initially, combining the PyTorch DataLoader and h5py was challenging, but I found a workaround. There might be better solutions that I’m not aware of, but here’s what worked for me: In a typical PyTorch DataLoader, the HDF5 file is opened in the __init__() function and read from in __getitem__(). However, when num_workers > 1, this approach fails. The fix is to open the HDF5 file inside the __getitem__() method instead of __init__(). This resolves the issue, allowing the DataLoader to work with multiple workers.