Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error #18

Open
landscapelc opened this issue Sep 14, 2021 · 3 comments
Open

CUDA error #18

landscapelc opened this issue Sep 14, 2021 · 3 comments

Comments

@landscapelc
Copy link

landscapelc commented Sep 14, 2021

问题描述

执行python3.7 run_net.py --config-file=configs/retinanet_gaofen.py --task=train后报错CUDA error

完整日志

XXX@DESKTOP-8B01LP5:/mnt/e/cpt/JDet-master/projects/retinanet$ python3.7 run_net.py --config-file=configs/retinanet_gaofen.py --task=train

[i 0914 20:35:15.018271 64 compiler.py:869] Jittor(1.2.3.101) src: /home/llc/.local/lib/python3.7/site-packages/jittor
[i 0914 20:35:15.024461 64 compiler.py:870] g++ at /usr/bin/g++(7.5.0)
[i 0914 20:35:15.024553 64 compiler.py:871] cache_path: /home/llc/.cache/jittor/default/g++
[i 0914 20:35:15.319920 64 install_cuda.py:37] cuda_driver_version: [11, 6]
[i 0914 20:35:15.337710 64 init.py:286] Found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/bin/nvcc(11.2.152) at /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/bin/nvcc.
[i 0914 20:35:15.403338 64 init.py:286] Found addr2line(2.30) at /usr/bin/addr2line.
[i 0914 20:35:15.491815 64 compiler.py:959] py_include: -I/usr/include/python3.7m -I/usr/include/python3.7m
[i 0914 20:35:15.579729 64 compiler.py:961] extension_suffix: .cpython-37m-x86_64-linux-gnu.so
[i 0914 20:35:15.719783 64 init.py:178] Total mem: 7.75GB, using 2 procs for compiling.
[i 0914 20:35:16.493494 64 jit_compiler.cc:22] Load cc_path: /usr/bin/g++
[i 0914 20:35:16.493646 64 init.cc:57] Found cuda archs: [75,]
[i 0914 20:35:16.641731 64 compile_extern.py:451] mpicc not found, distribution disabled.
[i 0914 20:35:16.717446 64 compile_extern.py:20] found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/include/cublas.h
[i 0914 20:35:16.739669 64 compile_extern.py:20] found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/lib64/libcublas.so
[i 0914 20:35:16.739794 64 compile_extern.py:20] found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/lib64/libcublasLt.so.11
[i 0914 20:35:17.317255 64 compile_extern.py:20] found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/include/cudnn.h
[i 0914 20:35:17.341903 64 compile_extern.py:20] found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/lib64/libcudnn.so.8
[i 0914 20:35:17.341998 64 compile_extern.py:20] found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/lib64/libcudnn_ops_infer.so.8
[i 0914 20:35:17.349224 64 compile_extern.py:20] found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/lib64/libcudnn_ops_train.so.8
[i 0914 20:35:17.350055 64 compile_extern.py:20] found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/lib64/libcudnn_cnn_infer.so.8
[i 0914 20:35:17.395974 64 compile_extern.py:20] found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/lib64/libcudnn_cnn_train.so.8
[i 0914 20:35:17.411565 64 compiler.py:667] handle pyjt_include/home/llc/.local/lib/python3.7/site-packages/jittor/extern/cuda/cudnn/inc/cudnn_warper.h
[i 0914 20:35:17.923592 64 compile_extern.py:20] found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/include/curand.h
[i 0914 20:35:17.950855 64 compile_extern.py:20] found /home/llc/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/lib64/libcurand.so
[i 0914 20:35:18.847675 64 cuda_flags.cc:26] CUDA enabled.
Loading config from: configs/retinanet_gaofen.py
[e 0914 20:35:22.246316 64 init.py:996] load parameter rpn_net.retina_cls.weight failed: expect the shape of rpn_net.retina_cls.weight to be [777,256,3,3,], but got [315,256,3,3,]
[e 0914 20:35:22.246449 64 init.py:996] load parameter rpn_net.retina_cls.bias failed: expect the shape of rpn_net.retina_cls.bias to be [777,], but got [315,]
[w 0914 20:35:22.246808 64 init.py:998] load total 311 params, 2 failed
Tue Sep 14 20:35:22 2021 Loading model parameters from weights/yx_init_pretrained.pk_jt.pk
Tue Sep 14 20:35:22 2021 Loading model parameters from work_dirs/retinanet_gaofen/checkpoints/ckpt_30.pkl
Tue Sep 14 20:35:22 2021 Start running
Tue Sep 14 20:35:22 2021 Testing...
0%| | 0/1126 [00:00<?, ?it/s]
[e 0914 20:35:28.524608 64 executor.cc:527]
=== display_memory_info ===
total_cpu_ram: 7.75GB total_cuda_ram: 24GB
hold_vars: 587 lived_vars: 3579 lived_ops: 3546
update queue: 311/311
name: sfrl is_cuda: 1 used: 210.1MB(94.6%) unused: 11.94MB(5.38%) total: 222MB
name: sfrl is_cuda: 1 used: 367.1MB(92%) unused: 31.85MB(7.98%) total: 399MB
name: sfrl is_cuda: 0 used: 367.1MB(92%) unused: 31.85MB(7.98%) total: 399MB
name: sfrl is_cuda: 0 used: 180.5KB(17.6%) unused: 843.5KB(82.4%) total: 1MB
name: temp is_cuda: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_cuda: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 1021MB gpu: 621MB cpu: 400MB
free: cpu(922.4MB) gpu(22.09GB)

[e 0914 20:35:28.525250 64 executor.cc:531] [Error] source file location: /home/llc/.cache/jittor/default/g++/jit/_opkey0:broadcast_to_Tx:float32__DIM=7__BCAST=19__JIT:1__JIT_cuda:1__index_t:int32___opkey...hash:7e74aa6468b00eb_op.c
c
0%| | 0/1126 [00:05<?, ?it/s]
Traceback (most recent call last):
File "run_net.py", line 54, in
main()
File "run_net.py", line 45, in main
runner.run()
File "/usr/local/lib/python3.7/dist-packages/jdet-0.1.0.0-py3.7.egg/jdet/runner/runner.py", line 89, in run
self.test()
File "/home/llc/.local/lib/python3.7/site-packages/jittor/init.py", line 89, in inner
ret = func(*args, **kw)
File "/home/llc/.local/lib/python3.7/site-packages/jittor/init.py", line 257, in inner
ret = func(*args, **kw)
File "/usr/local/lib/python3.7/dist-packages/jdet-0.1.0.0-py3.7.egg/jdet/runner/runner.py", line 197, in test
result = self.model(images,targets)
File "/home/llc/.local/lib/python3.7/site-packages/jittor/init.py", line 737, in call
return self.execute(*args, **kw)
File "/usr/local/lib/python3.7/dist-packages/jdet-0.1.0.0-py3.7.egg/jdet/models/networks/retinanet.py", line 64, in execute
results,losses = self.rpn_net(features, targets)
File "/home/llc/.local/lib/python3.7/site-packages/jittor/init.py", line 737, in call
return self.execute(*args, **kw)
File "/usr/local/lib/python3.7/dist-packages/jdet-0.1.0.0-py3.7.egg/jdet/models/roi_heads/retina_head.py", line 351, in execute
results = self.get_bboxes(all_proposals,all_bbox_pred,all_cls_score,targets)
File "/usr/local/lib/python3.7/dist-packages/jdet-0.1.0.0-py3.7.egg/jdet/models/roi_heads/retina_head.py", line 231, in get_bboxes
jt.sync([bbox_j, score_j])
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.sync)).

Types of your inputs are:
self = module,
args = (list, ),

The function declarations are:
void sync(const vector<VarHolder*>& vh=vector<VarHolder*>(), bool device_sync=false)

Failed reason:[f 0914 20:35:28.525344 64 executor.cc:533] Execute fused operator(116/574) failed: [Op(0x2d46dcb0:0:0:1:i1:o1:s0,broadcast_to->0x2de7ec90),Op(0x2d32f690:0:0:1:i1:o1:s0,reindex->0x2d433bc0),Op(0x2e5f0e30:0:0:1:i2:o1:s0
,binary.multiply->0x2de764c0),Op(0x2e5f3e30:0:0:1:i1:o1:s0,reduce.add->0x2de7a8c0),]

Reason: [f 0914 20:35:28.524532 64 helper_cuda.h:126] CUDA error at /home/llc/.local/lib/python3.7/site-packages/jittor/src/mem/allocator/cuda_managed_allocator.cc:23 code=2( cudaErrorMemoryAllocation ) cudaMallocManaged(&ptr, size
)

@liliwannian
Copy link

请问解决了吗?

@YbugY
Copy link

YbugY commented Sep 14, 2022

请问最后怎么解决的

@SnowNation101
Copy link

请问如何解决的?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants