15、Python第三方库：HDF5 简介

Hierarchical Data Format Version 5, HDF5：层次性数据格式第五版，是一种存储相同类型数值的大数组的机制，适用于可被层次性组织且数据集需要被元数据标记的数据模型，常用的接口模块为 h5py

man_world

27904人浏览 · 2019-04-12 15:04:37

man_world · 2019-04-12 15:04:37 发布

文章目录

一、简介

Hierarchical Data Format Version 5, HDF5： 层次性数据格式第五版
- 是一种存储相同类型数值的大数组的机制，适用于可被层次性组织且数据集需要被元数据标记的数据模型
- 常用的接口模块为 h5py
HDF5 三大要素：
- hdf5 files： 能够存储两类数据对象 dataset 和 group 的容器，其操作类似 python 标准的文件操作；File 实例对象本身就是一个组，以 / 为名，是遍历文件的入口
- dataset(array-like)： 可类比为 Numpy 数组，每个数据集都有一个名字（name）、形状（shape）和类型（dtype），支持切片操作
- group(folder-like)： 可以类比为字典，它是一种像文件夹一样的容器；group 中可以存放 dataset 或者其他的 group，键就是组成员的名称，值就是组成员对象本身(组或者数据集)
HDF5 数据可视化工具 HDFView：
- 支持全平台，可查看数据的细节
- 注意： 打开路径中不要包含中文

二、HDF5 Files

HDF5 Files work generally like standard python file objects, the File instance we created is itself a group, in this case the root group, named /, so File instance acts like a Python dictionary.

1、文件对象 f 的属性和方法：

属性：f.name f.filename f.mode
方法：
- f.keys()： f[f.keys()[i]].value
- f.values()：
  - 存入的是 Numpy 数组，取出的是 h5py.Dataset类的实例，这是一个代理对象，它会代理你的请求读写磁盘上的 HDF5 数据，对一个 Dataset 对象进行切片操作会返回一个 Numpy 数组。f.values()[0][:] or f[f.keys()[0]][:]
  - key 为 dataset 时的属性
    - f.values()[i].name or f[key].name: 以根目录为起点 eg: u'/data' u'/label'
    - f.values()[i].shape or f[key].shape
    - f.values()[i].dtype or f[key].dtype
    - f.values()[i].value or f[key].value
- f.items()
- f.create_dataset()
- f.create_group()
- 注意：
  - py2 中以迭代器的方式取数据为：f.iterkeys()、f.itervalues()、f.iteritems() py3 中不加 iter 即为迭代器方式
  - Names of all objects in the file are all text strings (unicode on py2, str on py3)

2、文件读写

# 使用 gzip 压缩，压缩等级 1 表示压缩速度最快, 但是压缩比最差
comp_kwargs = {'compression': 'gzip', 'compression_opts': 1}

# 往 h5 文件里面写数据
with h5py.File('xxx.h5', 'w') as f:
    f.create_dataset('data', data=np.array((imgs[start: end] - imgsMean) / 255.0).astype(np.float32), **comp_kwargs)
    f.create_dataset('label', data=np.array(labels[start: end]).astype(np.float32), **comp_kwargs)

# 读 h5 文件里面的数据
with h5py.File('xxx.h5', 'r') as f:
	for key in f.keys()
		print(f[key].name)
		print(f[key].value)

三、Datasets

Datasets work like NumPy arrays !

1、创建数据集：`f.create_dataset() or 字典赋值创建`

创建空数据集时，只需指定数据集的 name 和 shape 即可，dtype 默认为 np.float32，默认填充值为 0，亦可通过关键字参数 fillvalue 来改变
创建非空数据集时，只需指定 name 和具体的数据 data 即可，shape 和 dtype 都会从 data 中自动获取，当然你也可以显示的指定存储类型来节省空间。（单精度浮点比双精度浮点要节省一半的空间）
改变形状：创建数据集时指定一个跟输入的数组不同的形状，只要两个形状的元素个数相等

2、h5py.File.create_dataset 函数的用法

h5py.File.create_dataset(self, name, shape=None, dtype=None, data=None, **kwds)：
	name：
		Name of the dataset (absolute or relative).  Provide None to make an anonymous dataset.
	shape：
		Dataset shape.  Use "()" for scalar datasets.  Required if "data" isn't provided.
	dtype：
	    Numpy dtype or string.  If omitted, dtype('f') will be used.
	    Required if "data" isn't provided; otherwise, overrides data
	    array's dtype.
	data：
	    Provide data to initialize the dataset.  If used, you can omit
	    shape and dtype arguments.
	
	Keyword-only arguments:
	chunks
	    (Tuple) Chunk shape, or True to enable auto-chunking.
	maxshape
	    (Tuple) Make the dataset resizable up to this shape.  Use None for
	    axes you want to be unlimited.
	compression
	    (String or int) Compression strategy.  Legal values are 'gzip',
	    'szip', 'lzf'.  If an integer in range(10), this indicates gzip
	    compression level. Otherwise, an integer indicates the number of a
	    dynamically loaded compression filter.
	compression_opts
	    Compression settings.  This is an integer for gzip, 2-tuple for
	    szip, etc. If specifying a dynamically loaded compression filter
	    number, this must be a tuple of values.
	shuffle
	    (T/F) Enable shuffle filter.
	fillvalue
	    (Scalar) Use this value for uninitialized parts of the dataset.