1. 报错信息

使用docker-compose启动镜像的时候,报错:

ERROR: for huangs_devel_env  
Cannot start service huangs_devel_env: 
failed to create shim: 
OCI runtime create failed: 
container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook 
#1:: error running hook: exit status 1, 
stdout: , 
stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown
ERROR: Encountered errors while bringing up the project.

错误信息很明显,就是NVIDIA的一个问题。

2. 错误原因

根据Github-NVIDIA/nvidia-docker:Unable to start an interactive container with nvidia-docker due to driver/library version mismatch #1451

  • 其中提到,这应该不是镜像的原因
  • 因为在服务器中使用nvidia-smi,也会报类似的错误。
    user@CNBM:$ nvidia-smi
    Failed to initialize NVML:Driver/library version-mismatch
    

根据GPU服务器docker启动失败问题解决

  • 输入nvidia-container-cli -k -d /dev/tty info查看错误详情
    user@C4P:/$ nvidia-container-cli -k -d /dev/tty info
    
    -- WARNING, the following logs are for debugging purposes only --
    
    I0422 15:02:53.662348 3372214 nvc.c:376] initializing library context (version=1.9.0, build=5e135c17d6dbae861ec343e9a8d3a0d2af758a4f)
    I0422 15:02:53.662441 3372214 nvc.c:350] using root /
    I0422 15:02:53.662454 3372214 nvc.c:351] using ldcache /etc/ld.so.cache
    I0422 15:02:53.662470 3372214 nvc.c:352] using unprivileged user 1000:1000
    I0422 15:02:53.662556 3372214 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
    I0422 15:02:53.662941 3372214 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
    W0422 15:02:53.754123 3372215 nvc.c:273] failed to set inheritable capabilities
    W0422 15:02:53.754230 3372215 nvc.c:274] skipping kernel modules load due to failure
    I0422 15:02:53.754782 3372216 rpc.c:71] starting driver rpc service
    I0422 15:02:53.770292 3372214 rpc.c:135] driver rpc service terminated with signal 15
    nvidia-container-cli: initialization error: nvml error: driver/library version mismatch
    I0422 15:02:53.770340 3372214 nvc.c:430] shutting down library context
    
    
  • 因此,错误是很一致的,就是GPU驱动版本不匹配。

参考:

3. 解决错误

似乎是因为nvidia驱动会自动更新,而且无法关闭这个自动更新,所以如果不重启的话,就会出现版本不匹配的问题。。。

重启服务器

sudo reboot now

参考:

Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐