如何定位TypeError: cannot pickle dict_keys object错误原因及解决NuScenes数据集在多进程并发训练或测试时出现的这个错误
UniAD的NuScenesE2EDataset继承自NuScenesDataset,其实例里面的eval_detection_configs数据就是这么来的,里面的class_names的值默认是通过dict.keys()获得的,没有转换成Pickler支持的类型,这才导致了TypeError: cannot pickle 'dict_keys' object
近期实验自动驾驶大模型UniAD时,发现按照默认的配置跑程序在我自己安装的环境里(我的环境里CUDA和pytorch等各种支撑软件比作者在github上列的要新)运行测试时总是报错TypeError: cannot pickle 'dict_keys' object
File "./tools/test.py", line 261, in <module>
main()
File "./tools/test.py", line 231, in main
outputs = custom_multi_gpu_test(model, data_loader, args.tmpdir,
File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/apis/test.py", line 88, in custom_multi_gpu_test
for i, data in enumerate(data_loader):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 438, in __iter__
return self._get_iterator()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1048, in __init__
w.start()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'dict_keys' object
开始有点抓狂,虽然知道python多进程并发时需要在多个进程间传递对象并且是通过使用pickle模块进行对象的序列化和反序列化,但是不知道到底是哪里的代码写法不合要求导致哪个对象不能序列化,表面上看是dataloader在刚开始迭代访问数据时触发的,但是dataloader是pytorch标准的dataloader,猜测很可能跟dataset有关,但是看UniAD里projects/mmdet3d_plugin/datasets/nuscenes_e2e_dataset.py里的NuScenesE2EDataset类的代码没发现什么异常,起码没看到有什么地方把dict_keys数据往数据集里放。NuScenesE2EDataset从mmdetection3d/mmdet3d/datasets/nuscenes_dataset.py里的NuScenesDataset继承来的,但是mmdetection3d用了那么久了怎么网上查不到别人报告有这样的问题呢?况且mmdetection3d的NuScenesDataset里又调用了nuscenes包里的基础代码,牵涉挺多一下看不出到底是哪里代码不符合要求,导致被序列化的对象里含有dict_keys,只好回头老老实实看出错的地方的代码,也就是python3.8/multiprocessing/reduction.py里这个地方:
class ForkingPickler(pickle.Pickler):
'''Pickler subclass used by multiprocessing.'''
_extra_reducers = {}
_copyreg_dispatch_table = copyreg.dispatch_table
def __init__(self, *args):
super().__init__(*args)
self.dispatch_table = self._copyreg_dispatch_table.copy()
self.dispatch_table.update(self._extra_reducers)
@classmethod
def register(cls, type, reduce):
'''Register a reduce function for a type.'''
cls._extra_reducers[type] = reduce
@classmethod
def dumps(cls, obj, protocol=None):
buf = io.BytesIO()
cls(buf, protocol).dump(obj)
return buf.getbuffer()
loads = pickle.loads
register = ForkingPickler.register
def dump(obj, file, protocol=None):
'''Replacement for pickle.dump() using ForkingPickler.'''
ForkingPickler(file, protocol).dump(obj)
抛出异常的地方是dump()函数里的ForkingPickler(file, protocol).dump(obj),更底层的stack没有了,猜测这里再往下应该是调用的C++实现的库,于是想继续查找 pickle.Pickler的实现代码,发现python3.8/pickle.py里Pickler是公开对外的类但是没找到它的定义代码,只有个_Pickler内部类
__all__ = ["PickleError", "PicklingError", "UnpicklingError", "Pickler",
"Unpickler", "dump", "dumps", "load", "loads"]
...
# Pickling machinery
class _Pickler:
def __init__(self, file, protocol=None, *, fix_imports=True,
buffer_callback=None):
"""This takes a binary file for writing a pickle data stream.
...
仔细找发现了这段代码:
# Use the faster _pickle if possible
try:
from _pickle import (
PickleError,
PicklingError,
UnpicklingError,
Pickler,
Unpickler,
dump,
dumps,
load,
loads
)
except ImportError:
Pickler, Unpickler = _Pickler, _Unpickler
dump, dumps, load, loads = _dump, _dumps, _load, _loads
这就明白了,Pickler类肯定是从有个名叫_pickle的so库(查找了一下这个库是python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so)里导入的,导入出错时使用内部类_Pickler来替代Pickler,根据提示看来Pickler是C++版的快速版,python实现的内部类_Pickler则是低速版的实现。
上面的报信息里并没有说dict_keys在什么object里,如果我们能打印出来知道这个object应该能提供些线索,但是去下载C++版的Pickler代码加打印后编译然后替换python默认提供的python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so不仅工作量大还可能面临版本不匹配引发新的错误问题,所以想,既然pickle.py里提供了纯python实现的低速版_Pickler类,那它的功能实现应该是正确的只是速度慢而已,用它替换来调查问题因为是纯python的还能报出异常的完整堆栈岂不是很好,于是修改python3.8/multiprocessing/reduction.py将class ForkingPickler(pickle.Pickler):
改成
class ForkingPickler(pickle._Pickler):
然后再运行等抛出异常时,根据异常堆栈信息就可以找到我们可以加打印的合适地方了:
Traceback (most recent call last):
File "./tools/test.py", line 261, in <module>
main()
File "./tools/test.py", line 231, in main
outputs = custom_multi_gpu_test(model, data_loader, args.tmpdir,
File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/apis/test.py", line 88, in custom_multi_gpu_test
for i, data in enumerate(data_loader):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 438, in __iter__
return self._get_iterator()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1048, in __init__
w.start()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 61, in dump
ForkingPickler(file, protocol).dump(obj)
File "/opt/conda/lib/python3.8/pickle.py", line 489, in dump
self.save(obj)
File "/opt/conda/lib/python3.8/pickle.py", line 605, in save
self.save_reduce(obj=obj, *rv)
File "/opt/conda/lib/python3.8/pickle.py", line 719, in save_reduce
save(state)
File "/opt/conda/lib/python3.8/pickle.py", line 562, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.8/pickle.py", line 973, in save_dict
self._batch_setitems(obj.items())
File "/opt/conda/lib/python3.8/pickle.py", line 999, in _batch_setitems
save(v)
File "/opt/conda/lib/python3.8/pickle.py", line 562, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.8/pickle.py", line 903, in save_tuple
save(element)
File "/opt/conda/lib/python3.8/pickle.py", line 605, in save
self.save_reduce(obj=obj, *rv)
File "/opt/conda/lib/python3.8/pickle.py", line 719, in save_reduce
save(state)
File "/opt/conda/lib/python3.8/pickle.py", line 562, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.8/pickle.py", line 973, in save_dict
self._batch_setitems(obj.items())
File "/opt/conda/lib/python3.8/pickle.py", line 999, in _batch_setitems
save(v)
File "/opt/conda/lib/python3.8/pickle.py", line 605, in save
self.save_reduce(obj=obj, *rv)
File "/opt/conda/lib/python3.8/pickle.py", line 719, in save_reduce
save(state)
File "/opt/conda/lib/python3.8/pickle.py", line 562, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.8/pickle.py", line 973, in save_dict
self._batch_setitems(obj.items())
File "/opt/conda/lib/python3.8/pickle.py", line 999, in _batch_setitems
save(v)
File "/opt/conda/lib/python3.8/pickle.py", line 580, in save
rv = reduce(self.proto)
TypeError: cannot pickle 'dict_keys' object
根据上面的堆栈信息去对照看代码,考虑到object里可能又嵌套有object(例如dict里嵌套有dict),发现在_Pickler类的save_dict()这个成员方法里加打印比较合适:
def save_dict(self, obj):
if self.bin:
self.write(EMPTY_DICT)
else: # proto 0 -- can't use EMPTY_DICT
self.write(MARK + DICT)
self.memoize(obj)
strobj = str(obj)
if "dict_keys" in strobj:
print("@@@@@@@@@@@@@@@obj:", obj, "@@@@@@@##### \n")
self._batch_setitems(obj.items())
另外在save()成员方法的出错的前面加打印,打出出错的object本身以及前面没出错的相邻的object的type信息(不要直接打印所有object,否则打印非常多,重新问题慢或者并发时出现其它新的问题):
# Check for a __reduce_ex__ method, fall back to __reduce__
reduce = getattr(obj, "__reduce_ex__", None)
print("!!!!!!######self.proto:", self.proto, type(obj), "!!!!%%")
if reduce is not None:
if 'dict_keys' in str(type(obj)): # Arnold
print("obj is dict_keys, obj:",obj)
rv = reduce(self.proto)
然后运行程序直至抛出异常,可以看到打印出我想要看到的重要信息了!
@@@@@@@@@@@@@@@obj: {'class_range': {'car': 50, 'truck': 50, 'bus': 50, 'trailer': 50, 'construction_vehicle': 50, 'pedestrian': 40, 'motorcycle': 40, 'bicycle': 40, 'traffic_cone': 30, 'barrier': 30}, 'dist_fcn': 'center_distance', 'dist_ths': [0.5, 1.0, 2.0, 4.0], 'dist_th_tp': 2.0, 'min_recall': 0.1, 'min_precision': 0.1, 'max_boxes_per_sample': 500, 'mean_ap_weight': 5, 'class_names': dict_keys(['car', 'truck', 'bus', 'trailer', 'construction_vehicle', 'pedestrian', 'motorcycle', 'bicycle', 'traffic_cone', 'barrier'])} @@@@@@@#####
...
!!!!!!######self.proto: 4 <class 'mmdet3d.datasets.pipelines.compose.Compose'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.loading.LoadMultiViewImageFromFilesInCeph'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmcv.utils.config.ConfigDict'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.transform_3d.NormalizeMultiviewImage'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.dtype[float32]'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.transform_3d.PadMultiViewImage'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.loading.LoadAnnotations3D_E2E'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.occflow_label.GenerateOccFlowLabels'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmcv.utils.config.ConfigDict'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.dtype[int64]'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmdet3d.datasets.pipelines.test_time_aug.MultiScaleFlipAug3D'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmdet3d.datasets.pipelines.compose.Compose'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmdet3d.datasets.pipelines.formating.DefaultFormatBundle3D'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.transform_3d.CustomCollect3D'> !!!!%%
!!!!!!######self.proto: 4 <class 'nuscenes.eval.detection.data_classes.DetectionConfig'> !!!!%%
!!!!!!######self.proto: 4 <class 'dict_keys'> !!!!%%
obj is dict_keys, obj: dict_keys(['car', 'truck', 'bus', 'trailer', 'construction_vehicle', 'pedestrian', 'motorcycle', 'bicycle', 'traffic_cone', 'barrier'])
通过上面的信息可以清楚地看到,不能序列化导致抛出异常的数据是:
'class_names': dict_keys(['car', 'truck', 'bus', 'trailer', 'construction_vehicle', 'pedestrian', 'motorcycle', 'bicycle', 'traffic_cone', 'barrier'])
它又是一个dict的一部分,包含它的这个dict不知道是谁,但是class_names以及这个dict里其它数据项信息(例如class_range和dist_fn)可以帮助我们去代码里搜索定位这个dict是哪里产生的,于是去搜索UniAD代码和nuscenes代码,发现这些数据是由nuscenes.eval.detection.data_classes.DetectionConfig类生成的,它在python3.8/site-packages/nuscenes/eval/detection/data_classes.py里:
class DetectionConfig:
""" Data class that specifies the detection evaluation settings. """
def __init__(self,
class_range: Dict[str, int],
dist_fcn: str,
dist_ths: List[float],
dist_th_tp: float,
min_recall: float,
min_precision: float,
max_boxes_per_sample: int,
mean_ap_weight: int):
assert set(class_range.keys()) == set(DETECTION_NAMES), "Class count mismatch." ### DETECTION_NAMES
assert dist_th_tp in dist_ths, "dist_th_tp must be in set of dist_ths."
self.class_range = class_range
self.dist_fcn = dist_fcn
self.dist_ths = dist_ths
self.dist_th_tp = dist_th_tp
self.min_recall = min_recall
self.min_precision = min_precision
self.max_boxes_per_sample = max_boxes_per_sample
self.mean_ap_weight = mean_ap_weight
self.class_names = self.class_range.keys()
...
class_range的参数定义在python3.8/site-packages/nuscenes/eval/detection/configs/detection_cvpr_2019.json里:
"class_range": {
"car": 50,
"truck": 50,
"bus": 50,
"trailer": 50,
"construction_vehicle": 50,
"pedestrian": 40,
"motorcycle": 40,
"bicycle": 40,
"traffic_cone": 30,
"barrier": 30
},
在创建DetectionConfig类实例时由构造函数传入的,而class_names则是由 class_range的keys得来的:
self.class_names = self.class_range.keys()
这句话就是根本原因所在!因为python3的dict.keys()获得的是dict_keys而不是list,导致了class_names不能被Pickler序列化!只要把这句话改成:
self.class_names = list(self.class_range.keys())
然后再运行程序时就不出错了!
在继续看了一下DetectionConfig实例是由谁创建的,原来真的跟Dataset有点关系,mmdetection3d/mmdet3d/datasets/nuscenes_dataset.py里的NuScenesDataset的构造函数是这样的:
class NuScenesDataset(Custom3DDataset):
...
def __init__(self,
ann_file,
pipeline=None,
data_root=None,
classes=None,
load_interval=1,
with_velocity=True,
modality=None,
box_type_3d='LiDAR',
filter_empty_gt=True,
test_mode=False,
eval_version='detection_cvpr_2019',
use_valid_flag=False):
self.load_interval = load_interval
self.use_valid_flag = use_valid_flag
super().__init__(
data_root=data_root,
ann_file=ann_file,
pipeline=pipeline,
classes=classes,
modality=modality,
box_type_3d=box_type_3d,
filter_empty_gt=filter_empty_gt,
test_mode=test_mode)
self.with_velocity = with_velocity
self.eval_version = eval_version
from nuscenes.eval.detection.config import config_factory
self.eval_detection_configs = config_factory(self.eval_version)
这里最后一句self.eval_detection_configs = config_factory(self.eval_version)就是去读取python3.8/site-packages/nuscenes/eval/detection/configs/detection_cvpr_2019.json配置文件创建DetectionConfig实例,从而获得class_range等配置数据和获得class_names值。
UniAD的NuScenesE2EDataset继承自NuScenesDataset,其实例里面的eval_detection_configs数据就是这么来的,里面的class_names的值默认是通过dict.keys()获得的,没有转换成Pickler支持的类型,这才导致了TypeError: cannot pickle 'dict_keys' object
另外,如果我们在python3.8/multiprocessing/popen_spawn_posix.py里这个地方加打印,可以看到错误是与进程的数据序列化有关的,每次调用reduction.dump(prep_data, fp)是成功的,调用reduction.dump(process_obj, fp)时抛出上面的异常:
class Popen(popen_fork.Popen):
method = 'spawn'
DupFd = _DupFd
def __init__(self, process_obj):
self._fds = []
super().__init__(process_obj)
def duplicate_for_child(self, fd):
self._fds.append(fd)
return fd
def _launch(self, process_obj):
from . import resource_tracker
tracker_fd = resource_tracker.getfd()
self._fds.append(tracker_fd)
prep_data = spawn.get_preparation_data(process_obj._name)
fp = io.BytesIO()
set_spawning_popen(self)
try:
reduction.dump(prep_data, fp)
reduction.dump(process_obj, fp)
finally:
set_spawning_popen(None)
另外,网上其他人遇到的这样的问题,虽然和数据集相关,但和我遇到的这种情况不大相同,例如打开lmdb文件后不能序列化、dataset本身的data的keys()得到dict_keys等等,遇到同样情况时可供参考:
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
所有评论(0)