【MindSpore】CPU可以正常运行的,但是GPU下报错
迪丽瓦拉
2024-04-02 04:38:04
0

DEVICE 报错

【操作步骤&问题现象】

1、context.set_context(device_target="CPU") 正常跑

2、context.set_context(device_target="GPU") 报错

【截图信息】

【日志信息】(可选,上传日志内容或者附件)

[ERROR] DEVICE(3664,7f3ee8ff0700,python):2022-07-01-07:39:30.407.528 [mindspore/ccsrc/runtime/device/gpu/blocking_queue.cc:50] Push] Invalid Input: ptr: 0x7f3d702400a0, len: 262144
[ERROR] MD(3664,7f3ee8ff0700,python):2022-07-01-07:39:30.407.697 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Unexpected error. Invalid data, the types or shapes of current row is different with previous row(i.e. do batch operation but drop_reminder is False, or without resize image into the same size, these will cause shapes differs).
Line of code : 529
File         : /home/jenkins/agent-working-dir/workspace/Compile_GPU_X86_CentOS_Cuda10/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc[CRITICAL] KERNEL(3664,7f404a7de700,python):2022-07-01-07:44:31.326.304 [mindspore/ccsrc/backend/kernel_compiler/gpu/data/dataset_iterator_kernel.cc:119] ReadDevice] For 'GetNext', get data timeout
[ERROR] RUNTIME_FRAMEWORK(3664,7f4055fe9700,python):2022-07-01-07:44:31.327.103 [mindspore/ccsrc/runtime/framework/actor/abstract_actor.cc:92] EraseInput] Erase input controls failed: Default/network-TrainOneStepCell/optimizer-Adam/Mul-op6918, sequential_num: 0
Traceback (most recent call last):File "SD_net_2_train_on_mindspore_GPU.py", line 205, in reduce_lr]File "/opt/conda/lib/python3.7/site-packages/mindspore/train/model.py", line 774, in trainsink_size=sink_size)File "/opt/conda/lib/python3.7/site-packages/mindspore/train/model.py", line 87, in wrapperfunc(self, *args, **kwargs)File "/opt/conda/lib/python3.7/site-packages/mindspore/train/model.py", line 540, in _trainself._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)File "/opt/conda/lib/python3.7/site-packages/mindspore/train/model.py", line 608, in _train_dataset_sink_processoutputs = train_network(*inputs)File "/opt/conda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 479, in __call__out = self.compile_and_run(*args)File "/opt/conda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 828, in compile_and_runreturn _cell_graph_executor(self, *new_inputs, phase=self.phase)File "/opt/conda/lib/python3.7/site-packages/mindspore/common/api.py", line 711, in __call__return self.run(obj, *args, phase=phase)File "/opt/conda/lib/python3.7/site-packages/mindspore/common/api.py", line 735, in runreturn self._exec_pip(obj, *args, phase=phase_real)File "/opt/conda/lib/python3.7/site-packages/mindspore/common/api.py", line 61, in wrapperresults = fn(*arg, **kwargs)File "/opt/conda/lib/python3.7/site-packages/mindspore/common/api.py", line 718, in _exec_pipreturn self._graph_executor(args, phase)
RuntimeError: mindspore/ccsrc/backend/kernel_compiler/gpu/data/dataset_iterator_kernel.cc:119 ReadDevice] For 'GetNext', get data timeout
[WARNING] MD(3664,7f40cc3bc740,python):2022-07-01-07:44:31.391.343 [mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc:73] ~DeviceQueueOp] preprocess_batch: 3; batch_queue: 0, 0, 5, 5, 5, 4, 9, 8; push_start_time: 2022-07-01-07:39:21.366.263, 2022-07-01-07:39:21.810.354, 2022-07-01-07:39:21.810.375, 2022-07-01-07:39:29.889.068; push_end_time: 2022-07-01-07:39:21.809.469, 2022-07-01-07:39:21.810.373, 2022-07-01-07:39:29.889.040.
[ERROR] DEVICE(3664,7f40cc3bc740,python):2022-07-01-07:45:31.966.256 [mindspore/ccsrc/runtime/device/gpu/gpu_buffer_mgr.cc:180] CloseNotify] time out of receiving signals
[ERROR] DEVICE(3664,7f40cc3bc740,python):2022-07-01-07:45:31.966.306 [mindspore/ccsrc/runtime/hardware/gpu/gpu_device_context.cc:165] Destroy] Could not close gpu data queue.

可能是GPU环境下某种数据类型不支持,或者GPU下loss出现NAN了;可以尝试设置为pynative动态图模式在GPU环境下跑一下,或许可以获得更详细的错误信息

GPU和CPU的特性是有差异的,不同的硬件都会有差异,所以同样的代码有的硬件下可能正常,有的就会出现NAN等问题;动态图模式下面是可以使用print()操作的,在你觉得可能出错的地方尝试print一下具体数据,看看能不能获得相应的信息,比如你这边是提示了无效的loss,尝试看看能不能在计算loss的地方print(loss)一下

相关内容