近期实验自动驾驶大模型UniAD时,发现按照默认的配置跑程序在我自己安装的环境里(我的环境里CUDA和pytorch等各种支撑软件比作者在github上列的要新)运行测试时总是报错TypeError: cannot pickle 'dict_keys' object

File "./tools/test.py", line 261, in <module>
    main()
  File "./tools/test.py", line 231, in main
    outputs = custom_multi_gpu_test(model, data_loader, args.tmpdir,
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/apis/test.py", line 88, in custom_multi_gpu_test
    for i, data in enumerate(data_loader):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 438, in __iter__
    return self._get_iterator()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1048, in __init__
    w.start()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'dict_keys' object

开始有点抓狂,虽然知道python多进程并发时需要在多个进程间传递对象并且是通过使用pickle模块进行对象的序列化和反序列化,但是不知道到底是哪里的代码写法不合要求导致哪个对象不能序列化,表面上看是dataloader在刚开始迭代访问数据时触发的,但是dataloader是pytorch标准的dataloader,猜测很可能跟dataset有关,但是看UniAD里projects/mmdet3d_plugin/datasets/nuscenes_e2e_dataset.py里的NuScenesE2EDataset类的代码没发现什么异常,起码没看到有什么地方把dict_keys数据往数据集里放。NuScenesE2EDataset从mmdetection3d/mmdet3d/datasets/nuscenes_dataset.py里的NuScenesDataset继承来的,但是mmdetection3d用了那么久了怎么网上查不到别人报告有这样的问题呢?况且mmdetection3d的NuScenesDataset里又调用了nuscenes包里的基础代码,牵涉挺多一下看不出到底是哪里代码不符合要求,导致被序列化的对象里含有dict_keys,只好回头老老实实看出错的地方的代码,也就是python3.8/multiprocessing/reduction.py里这个地方:

class ForkingPickler(pickle.Pickler):
    '''Pickler subclass used by multiprocessing.'''
    _extra_reducers = {}
    _copyreg_dispatch_table = copyreg.dispatch_table

    def __init__(self, *args):
        super().__init__(*args)
        self.dispatch_table = self._copyreg_dispatch_table.copy()
        self.dispatch_table.update(self._extra_reducers)

    @classmethod
    def register(cls, type, reduce):
        '''Register a reduce function for a type.'''
        cls._extra_reducers[type] = reduce

    @classmethod
    def dumps(cls, obj, protocol=None):
        buf = io.BytesIO()
        cls(buf, protocol).dump(obj)
        return buf.getbuffer()

    loads = pickle.loads

register = ForkingPickler.register

def dump(obj, file, protocol=None):
    '''Replacement for pickle.dump() using ForkingPickler.'''
    ForkingPickler(file, protocol).dump(obj)

抛出异常的地方是dump()函数里的ForkingPickler(file, protocol).dump(obj),更底层的stack没有了,猜测这里再往下应该是调用的C++实现的库,于是想继续查找 pickle.Pickler的实现代码,发现python3.8/pickle.py里Pickler是公开对外的类但是没找到它的定义代码,只有个_Pickler内部类

__all__ = ["PickleError", "PicklingError", "UnpicklingError", "Pickler",
           "Unpickler", "dump", "dumps", "load", "loads"]

...

# Pickling machinery

class _Pickler:

    def __init__(self, file, protocol=None, *, fix_imports=True,
                 buffer_callback=None):
        """This takes a binary file for writing a pickle data stream.
...

仔细找发现了这段代码:

# Use the faster _pickle if possible
try:
    from _pickle import (
        PickleError,
        PicklingError,
        UnpicklingError,
        Pickler,
        Unpickler,
        dump,
        dumps,
        load,
        loads
    )
except ImportError:
    Pickler, Unpickler = _Pickler, _Unpickler
    dump, dumps, load, loads = _dump, _dumps, _load, _loads

这就明白了,Pickler类肯定是从有个名叫_pickle的so库(查找了一下这个库是python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so)里导入的,导入出错时使用内部类_Pickler来替代Pickler,根据提示看来Pickler是C++版的快速版,python实现的内部类_Pickler则是低速版的实现。

    上面的报信息里并没有说dict_keys在什么object里,如果我们能打印出来知道这个object应该能提供些线索,但是去下载C++版的Pickler代码加打印后编译然后替换python默认提供的python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so不仅工作量大还可能面临版本不匹配引发新的错误问题,所以想,既然pickle.py里提供了纯python实现的低速版_Pickler类,那它的功能实现应该是正确的只是速度慢而已,用它替换来调查问题因为是纯python的还能报出异常的完整堆栈岂不是很好,于是修改python3.8/multiprocessing/reduction.py将class    ForkingPickler(pickle.Pickler):

改成 

class ForkingPickler(pickle._Pickler):

然后再运行等抛出异常时,根据异常堆栈信息就可以找到我们可以加打印的合适地方了:

Traceback (most recent call last):
  File "./tools/test.py", line 261, in <module>
    main()
  File "./tools/test.py", line 231, in main
    outputs = custom_multi_gpu_test(model, data_loader, args.tmpdir,
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/apis/test.py", line 88, in custom_multi_gpu_test
    for i, data in enumerate(data_loader):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 438, in __iter__
    return self._get_iterator()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1048, in __init__
    w.start()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 61, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/opt/conda/lib/python3.8/pickle.py", line 489, in dump
    self.save(obj)
  File "/opt/conda/lib/python3.8/pickle.py", line 605, in save
    self.save_reduce(obj=obj, *rv)
  File "/opt/conda/lib/python3.8/pickle.py", line 719, in save_reduce
    save(state)
  File "/opt/conda/lib/python3.8/pickle.py", line 562, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/opt/conda/lib/python3.8/pickle.py", line 973, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/conda/lib/python3.8/pickle.py", line 999, in _batch_setitems
    save(v)
  File "/opt/conda/lib/python3.8/pickle.py", line 562, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/opt/conda/lib/python3.8/pickle.py", line 903, in save_tuple
    save(element)
  File "/opt/conda/lib/python3.8/pickle.py", line 605, in save
    self.save_reduce(obj=obj, *rv)
  File "/opt/conda/lib/python3.8/pickle.py", line 719, in save_reduce
    save(state)
  File "/opt/conda/lib/python3.8/pickle.py", line 562, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/opt/conda/lib/python3.8/pickle.py", line 973, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/conda/lib/python3.8/pickle.py", line 999, in _batch_setitems
    save(v)
  File "/opt/conda/lib/python3.8/pickle.py", line 605, in save
    self.save_reduce(obj=obj, *rv)
  File "/opt/conda/lib/python3.8/pickle.py", line 719, in save_reduce
    save(state)
  File "/opt/conda/lib/python3.8/pickle.py", line 562, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/opt/conda/lib/python3.8/pickle.py", line 973, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/conda/lib/python3.8/pickle.py", line 999, in _batch_setitems
    save(v)
  File "/opt/conda/lib/python3.8/pickle.py", line 580, in save
    rv = reduce(self.proto)
TypeError: cannot pickle 'dict_keys' object

根据上面的堆栈信息去对照看代码,考虑到object里可能又嵌套有object(例如dict里嵌套有dict),发现在_Pickler类的save_dict()这个成员方法里加打印比较合适:

def save_dict(self, obj):
        if self.bin:
            self.write(EMPTY_DICT)
        else:   # proto 0 -- can't use EMPTY_DICT
            self.write(MARK + DICT)

        self.memoize(obj)

        strobj = str(obj)
        if "dict_keys" in strobj: 
           print("@@@@@@@@@@@@@@@obj:", obj, "@@@@@@@##### \n")

        self._batch_setitems(obj.items())

另外在save()成员方法的出错的前面加打印,打出出错的object本身以及前面没出错的相邻的object的type信息(不要直接打印所有object,否则打印非常多,重新问题慢或者并发时出现其它新的问题):

# Check for a __reduce_ex__ method, fall back to __reduce__
                reduce = getattr(obj, "__reduce_ex__", None)

                print("!!!!!!######self.proto:", self.proto, type(obj), "!!!!%%")
                if reduce is not None:
                    if 'dict_keys' in str(type(obj)):  # Arnold
                        print("obj is dict_keys, obj:",obj)

                    rv = reduce(self.proto)

然后运行程序直至抛出异常,可以看到打印出我想要看到的重要信息了!

@@@@@@@@@@@@@@@obj: {'class_range': {'car': 50, 'truck': 50, 'bus': 50, 'trailer': 50, 'construction_vehicle': 50, 'pedestrian': 40, 'motorcycle': 40, 'bicycle': 40, 'traffic_cone': 30, 'barrier': 30}, 'dist_fcn': 'center_distance', 'dist_ths': [0.5, 1.0, 2.0, 4.0], 'dist_th_tp': 2.0, 'min_recall': 0.1, 'min_precision': 0.1, 'max_boxes_per_sample': 500, 'mean_ap_weight': 5, 'class_names': dict_keys(['car', 'truck', 'bus', 'trailer', 'construction_vehicle', 'pedestrian', 'motorcycle', 'bicycle', 'traffic_cone', 'barrier'])} @@@@@@@#####

...

!!!!!!######self.proto: 4 <class 'mmdet3d.datasets.pipelines.compose.Compose'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.loading.LoadMultiViewImageFromFilesInCeph'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmcv.utils.config.ConfigDict'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.transform_3d.NormalizeMultiviewImage'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.dtype[float32]'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.transform_3d.PadMultiViewImage'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.loading.LoadAnnotations3D_E2E'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.occflow_label.GenerateOccFlowLabels'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmcv.utils.config.ConfigDict'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.dtype[int64]'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmdet3d.datasets.pipelines.test_time_aug.MultiScaleFlipAug3D'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmdet3d.datasets.pipelines.compose.Compose'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmdet3d.datasets.pipelines.formating.DefaultFormatBundle3D'> !!!!%%
!!!!!!######self.proto: 4 <class 'projects.mmdet3d_plugin.datasets.pipelines.transform_3d.CustomCollect3D'> !!!!%%
!!!!!!######self.proto: 4 <class 'nuscenes.eval.detection.data_classes.DetectionConfig'> !!!!%%
!!!!!!######self.proto: 4 <class 'dict_keys'> !!!!%%
obj is dict_keys, obj: dict_keys(['car', 'truck', 'bus', 'trailer', 'construction_vehicle', 'pedestrian', 'motorcycle', 'bicycle', 'traffic_cone', 'barrier'])

通过上面的信息可以清楚地看到,不能序列化导致抛出异常的数据是:

'class_names': dict_keys(['car', 'truck', 'bus', 'trailer', 'construction_vehicle', 'pedestrian', 'motorcycle', 'bicycle', 'traffic_cone', 'barrier'])

它又是一个dict的一部分,包含它的这个dict不知道是谁,但是class_names以及这个dict里其它数据项信息(例如class_range和dist_fn)可以帮助我们去代码里搜索定位这个dict是哪里产生的,于是去搜索UniAD代码和nuscenes代码,发现这些数据是由nuscenes.eval.detection.data_classes.DetectionConfig类生成的,它在python3.8/site-packages/nuscenes/eval/detection/data_classes.py里:

class DetectionConfig:
    """ Data class that specifies the detection evaluation settings. """

    def __init__(self,
                 class_range: Dict[str, int],
                 dist_fcn: str,
                 dist_ths: List[float],
                 dist_th_tp: float,
                 min_recall: float,
                 min_precision: float,
                 max_boxes_per_sample: int,
                 mean_ap_weight: int):

        assert set(class_range.keys()) == set(DETECTION_NAMES), "Class count mismatch."   ### DETECTION_NAMES
        assert dist_th_tp in dist_ths, "dist_th_tp must be in set of dist_ths."

        self.class_range = class_range
        self.dist_fcn = dist_fcn
        self.dist_ths = dist_ths
        self.dist_th_tp = dist_th_tp
        self.min_recall = min_recall
        self.min_precision = min_precision
        self.max_boxes_per_sample = max_boxes_per_sample
        self.mean_ap_weight = mean_ap_weight

        self.class_names = self.class_range.keys()
        ...

class_range的参数定义在python3.8/site-packages/nuscenes/eval/detection/configs/detection_cvpr_2019.json里:

 "class_range": {
    "car": 50,
    "truck": 50,
    "bus": 50,
    "trailer": 50,
    "construction_vehicle": 50,
    "pedestrian": 40,
    "motorcycle": 40,
    "bicycle": 40,
    "traffic_cone": 30,
    "barrier": 30
  },

  在创建DetectionConfig类实例时由构造函数传入的,而class_names则是由 class_range的keys得来的:

   self.class_names = self.class_range.keys()

这句话就是根本原因所在!因为python3的dict.keys()获得的是dict_keys而不是list,导致了class_names不能被Pickler序列化!只要把这句话改成:

  self.class_names = list(self.class_range.keys())

然后再运行程序时就不出错了!

在继续看了一下DetectionConfig实例是由谁创建的,原来真的跟Dataset有点关系,mmdetection3d/mmdet3d/datasets/nuscenes_dataset.py里的NuScenesDataset的构造函数是这样的:

class NuScenesDataset(Custom3DDataset):
   ...
   def __init__(self,
                 ann_file,
                 pipeline=None,
                 data_root=None,
                 classes=None,
                 load_interval=1,
                 with_velocity=True,
                 modality=None,
                 box_type_3d='LiDAR',
                 filter_empty_gt=True,
                 test_mode=False,
                 eval_version='detection_cvpr_2019',
                 use_valid_flag=False):
        self.load_interval = load_interval
        self.use_valid_flag = use_valid_flag
        super().__init__(
            data_root=data_root,
            ann_file=ann_file,
            pipeline=pipeline,
            classes=classes,
            modality=modality,
            box_type_3d=box_type_3d,
            filter_empty_gt=filter_empty_gt,
            test_mode=test_mode)

        self.with_velocity = with_velocity
        self.eval_version = eval_version
        from nuscenes.eval.detection.config import config_factory
        self.eval_detection_configs = config_factory(self.eval_version)

这里最后一句self.eval_detection_configs = config_factory(self.eval_version)就是去读取python3.8/site-packages/nuscenes/eval/detection/configs/detection_cvpr_2019.json配置文件创建DetectionConfig实例,从而获得class_range等配置数据和获得class_names值。

UniAD的NuScenesE2EDataset继承自NuScenesDataset,其实例里面的eval_detection_configs数据就是这么来的,里面的class_names的值默认是通过dict.keys()获得的,没有转换成Pickler支持的类型,这才导致了TypeError: cannot pickle 'dict_keys' object

另外,如果我们在python3.8/multiprocessing/popen_spawn_posix.py里这个地方加打印,可以看到错误是与进程的数据序列化有关的,每次调用reduction.dump(prep_data, fp)是成功的,调用reduction.dump(process_obj, fp)时抛出上面的异常:

class Popen(popen_fork.Popen):
    method = 'spawn'
    DupFd = _DupFd

    def __init__(self, process_obj):
        self._fds = []
        super().__init__(process_obj)

    def duplicate_for_child(self, fd):
        self._fds.append(fd)
        return fd

    def _launch(self, process_obj):
        from . import resource_tracker
        tracker_fd = resource_tracker.getfd()
        self._fds.append(tracker_fd)
        prep_data = spawn.get_preparation_data(process_obj._name)
        fp = io.BytesIO()
        set_spawning_popen(self)
        try:
            reduction.dump(prep_data, fp)
            reduction.dump(process_obj, fp)
        finally:
            set_spawning_popen(None)

另外,网上其他人遇到的这样的问题,虽然和数据集相关,但和我遇到的这种情况不大相同,例如打开lmdb文件后不能序列化、dataset本身的data的keys()得到dict_keys等等,遇到同样情况时可供参考:

DataLoader Multiprocessing error: can't pickle odict_keys objects when num_workers > 0 - PyTorch Forums

 TypeError: can't pickle Environment objects when num_workers > 0 for LSUN · Issue #689 · pytorch/vision · GitHub

Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐