昇腾CANN下Ascend CL应用开发(C/C++)中的那些坑
aclrtResetDevice 返回错误 507007aclrtResetDevice()返回错误 507007查看日志~/ascend/log/debug/plog/plog-PID_yyyyMMddhhmmssxxx.log给出如下提示[ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [api_impl.cc:1777]9277 DeviceReset:DeviceReset context release failed, userDevId0, retCode0x7070003 [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [api_impl.cc:1787]9277 DeviceReset:report error module_type0, module_nameEE9999 [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [api_impl.cc:1787]9277 DeviceReset:DeviceReset failed, deviceId0, retCode0x7070003 [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [logger.cc:692]9277 DeviceReset:Device reset failed, device_id0. [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [api_c.cc:1567]9277 rtDeviceReset:ErrCode507007, desc[context release error], InnerCode0x7070003 [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [error_message_manage.cc:49]9277 FuncErrorReason:report error module_type3, module_nameEE8888 [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [error_message_manage.cc:49]9277 FuncErrorReason:rtDeviceReset execute failed, reason[context release error] [ERROR] ASCENDCL(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [device.cpp:115]9277 aclrtResetDevice: reset device 0 failed, runtime result 507007.原因根据昇腾文档中的流程图如果没有显式调用aclrtSetDevice()而是手动调用的创建的Context和Stream就不要调用aclrtResetDevice()。文当并未明确说明未调用setDevice而调用resetDevice的后果但后果确实很严重可知Ascend CL 内部并未作相应的处理。Python调用C库返回错误 107002107002 表示context为空。使用python的ctypes加载so库当Python进入了控制台交互从控制台拿到反馈时。使用了另一个线程执行对话导致从python切换回so库时就是位于不同线程上的。因此每次返回C库时需要及时重新通过aclrtSetDevice()方法设置设备再用aclrtSetCurrentContext()方法绑定content才能沿用之前的stream。aclrtMemcpy 返回错误 507899aclrtMemcpy()返回错误 507899查看日志~/ascend/log/debug/plog/plog-PID_yyyyMMddhhmmssxxx.log给出如下提示[ERROR] DRV(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [drv_log_user.c:621][ascend][curpid:PID,PID][drv][devmm][share_log_read_in_single_module]Pcie fill bar2dma fail. (src_dev_id6; dst_dev_id7; ret-22) Make dma node-size check fail, please check addr size. (total_len0; count8192; idx_dma0; did6; dst_did7; src0x12c080013000; dst0x12c180016000; idx_src0; from_num1; idx_dst0; to_num1) Cp make dmanode list fail. (num1; ret-22; src0x12c080013000; dst0x12c180016000; count8192) Memcpy error. (ret-22; src0x12c080013000; dst0x12c180016000; count8192; direction3) Check alloced va. (hostpid161983; va0x12c080013000; start_va 0x12c080000000; end_va 0x12c080013fff) Check alloced va. (hostpid161983; va0x12c180016000; start_va 0x12c180000000; end_va 0x12c180016fff) [ERROR] DRV(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [devmm_svm.c:406][ascend][curpid:PID,PID][drv][devmm][devmm_copy_ioctl]errno:22, 8 Ioctl error. (cmd-1051177723; ret8; dst0x12c180016000; src0x12c080013000; size8192) [ERROR] DRV(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [devmm_virt_com_heap.c:1525][ascend][curpid:PID,PID][drv][devmm][devmm_print_svm_va_info]errno:22, 8 Va info. (va0x12c080013000; start0x12c080013000; end0x12c080015fff; module_nameAPP; devid0) [ERROR] DRV(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [devmm_virt_com_heap.c:1525][ascend][curpid:PID,PID][drv][devmm][devmm_print_svm_va_info]errno:22, 8 Va info. (va0x12c180016000; start0x12c180016000; end0x12c180018fff; module_nameAPP; devid1) [ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [npu_driver.cc:2305]PID MemCopySync:[drv api] drvMemcpy failed: destMax8192, size8192(Byte), kind3, devId4294967295, drvRetCode8! [ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [api_error.cc:1199]PID MemCopySync:Memcopy sync failed, count8192, kind3. [ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [api_c.cc:1193]PID rtMemcpy:ErrCode507899, desc[driver error:internal error], InnerCode0x7020010 [ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [error_message_manage.cc:53]PID FuncErrorReason:report error module_type3, module_nameEE8888 [ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [error_message_manage.cc:53]PID FuncErrorReason:rtMemcpy execute failed, reason[driver error:internal error] [ERROR] ASCENDCL(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [memory.cpp:303]11110 aclrtMemcpy: synchronized memcpy failed, kind 3, runtime result 507899如果设备之间支持互相复制aclrtDeviceCanAccessPeer()接口返回true那么需要在两个设备上均执行aclrtDeviceEnablePeerAccess()否则就会出现上述错误。在CANN 8.0 RC1的文档中aclrtDeviceEnablePeerAccess()只调用了一次而在后续8.0RC3的Ascend CL开发文档中该问题已经做了修正。实际测试aclrtDeviceEnablePeerAccess()可以仅仅切换Context。