Running issue with Nvidia NIM for NCCL. #1567

innat-asj · 2025-01-07T19:09:19Z

I'm trying to run nvidia NIM locally on windows using wsl2. It contains multi-gpu, each having around 48GB vram. While running the docker for nvidia-nim, I got following NCCL issue.

# check wsl2 version
lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.1 LTS
Release:        24.04
Codename:       noble

export NGC_API_KEY=nvapi-...
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY=$NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8080:8000 \
    nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

Error logs

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2025-01-07 18:58:17,892 [INFO] PyTorch version 2.2.2 available.
2025-01-07 18:58:18,460 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2025-01-07 18:58:18,460 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2025-01-07 18:58:18,637 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 01-07 18:58:19.318 api_server.py:489] NIM LLM API version 1.0.0
INFO 01-07 18:58:19.320 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-07 18:58:19.321 ngc_profile.py:219] Detected 2 compatible profile(s).
INFO 01-07 18:58:19.321 ngc_injector.py:106] Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2]
INFO 01-07 18:58:19.321 ngc_injector.py:106] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2]
INFO 01-07 18:58:19.321 ngc_injector.py:141] Selected profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
INFO 01-07 18:58:21.325 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 01-07 18:58:21.325 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 01-07 18:58:21.325 ngc_injector.py:146] Profile metadata: llm_engine: vllm
INFO 01-07 18:58:21.325 ngc_injector.py:146] Profile metadata: tp: 2
INFO 01-07 18:58:21.325 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 01-07 18:58:27.500 ngc_injector.py:172] Model workspace is now ready. It took 6.175 seconds
2025-01-07 18:58:29,450 WARNING services.py:2009 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-01-07 18:58:29,577 INFO worker.py:1749 -- Started a local Ray instance.
INFO 01-07 18:58:30.763 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-c7inq7wy', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-c7inq7wy', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 01-07 18:58:30.949 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-07 18:58:34.233 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
(RayWorkerWrapper pid=3970) INFO 01-07 18:58:34 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
WARNING 01-07 18:58:35.850 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 01-07 18:58:35 selector.py:28] Using FlashAttention backend.
(RayWorkerWrapper pid=3970) WARNING 01-07 18:58:35 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(RayWorkerWrapper pid=3970) INFO 01-07 18:58:35 selector.py:28] Using FlashAttention backend.
INFO 01-07 18:58:36 pynccl_utils.py:43] vLLM is using nccl==2.19.3
(RayWorkerWrapper pid=3970) INFO 01-07 18:58:36 pynccl_utils.py:43] vLLM is using nccl==2.19.3
ERROR 01-07 18:58:36.315 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
    pynccl_utils.init_process_group()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
    comm = NCCLCommunicator(group=group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
    NCCL_CHECK(
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
    engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
    engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 382, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 45, in _init_executor
    self._init_workers_ray(placement_group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 181, in _init_workers_ray
    self._run_workers("init_device")
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 158, in execute_method
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
    pynccl_utils.init_process_group()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
    comm = NCCLCommunicator(group=group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
    NCCL_CHECK(
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

In the log message, it says, it is using vllm which uses nccl==2.19.3. So based for the CUDA compatibility, I installed appropriate CUDA compiler, 12.3.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0

nvidia-smi

Wed Jan  8 04:06:49 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.02              Driver Version: 566.03         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:01:00.0 Off |                  Off |
| 30%   22C    P8             12W /  300W |       0MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:C1:00.0  On |                  Off |
| 30%   22C    P8             11W /  300W |     838MiB /  49140MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:C2:00.0 Off |                  Off |
| 30%   22C    P8              9W /  300W |       0MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        24      G   /Xwayland                                   N/A      |
|    0   N/A  N/A        40      G   /Xwayland                                   N/A      |
|    1   N/A  N/A        24      G   /Xwayland                                   N/A      |
|    1   N/A  N/A        40      G   /Xwayland                                   N/A      |
|    2   N/A  N/A        24      G   /Xwayland                                   N/A      |
|    2   N/A  N/A        40      G   /Xwayland                                   N/A      |

The text was updated successfully, but these errors were encountered:

sjeaugey · 2025-01-08T09:34:39Z

As the error log says: NCCL got a CUDA error it could not recover from; it would be good to run again setting NCCL_DEBUG=INFO to get an explanation as to what CUDA call failed, which which error. That would probably tell us what the problem is.

Also I noticed Wayland is running on the GPU; that could severely impact the performance of CUDA (since it would have to time share with Xwayland).

innat-asj · 2025-01-08T18:46:25Z

Please note that, with the same setup (CUDA, NCCL), using Huggingface API, I can successfully run data-parallel and model-parallel level computation without encountering such issue (NCCL conflict with CUDA). Using torch API, I can ensure the following also:

Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.distributed.is_nccl_available()
True

But when running the Nvidia NIM docker on the same environment, I got this unhandled CUDA NCCL error. I'm bit puzzle here, if its all about my env settings and configuration, torch should also complain, but it doesn't.

As you suggested, I ran with NCCL_DEBUG=INFO. Here is the full traceback.

docker run -it --rm \
    --gpus all \
    --shm-size=16GB \
    -e NCCL_DEBUG=INFO \
    -e NGC_API_KEY=$NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8080:8000 \
    nvcr.io/nim/meta/llama3-8b-instruct:latest

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.3
Model: meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2025-01-08 18:35:25,541 [INFO] PyTorch version 2.2.2 available.
2025-01-08 18:35:26,080 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2025-01-08 18:35:26,080 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2025-01-08 18:35:26,207 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 01-08 18:35:26.807 api_server.py:489] NIM LLM API version 1.0.0
INFO 01-08 18:35:26.809 ngc_profile.py:218] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-08 18:35:26.809 ngc_profile.py:220] Detected 2 compatible profile(s).
INFO 01-08 18:35:26.810 ngc_injector.py:107] Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2]
INFO 01-08 18:35:26.810 ngc_injector.py:107] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2]
INFO 01-08 18:35:26.810 ngc_injector.py:142] Selected profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: feat_lora: false
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: llm_engine: vllm
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: precision: fp16
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: tp: 2
INFO 01-08 18:35:26.811 ngc_injector.py:167] Preparing model workspace. This step might download additional files to run the model.
INFO 01-08 18:35:26.813 ngc_injector.py:173] Model workspace is now ready. It took 0.002 seconds
2025-01-08 18:35:28,883 INFO worker.py:1749 -- Started a local Ray instance.
INFO 01-08 18:35:30.197 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-kve8c9q8', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-kve8c9q8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 01-08 18:35:30.397 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-08 18:35:33.867 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
(RayWorkerWrapper pid=3960) INFO 01-08 18:35:33 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
WARNING 01-08 18:35:34.744 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 01-08 18:35:34 selector.py:28] Using FlashAttention backend.
(RayWorkerWrapper pid=3960) WARNING 01-08 18:35:34 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(RayWorkerWrapper pid=3960) INFO 01-08 18:35:34 selector.py:28] Using FlashAttention backend.
INFO 01-08 18:35:35 pynccl_utils.py:43] vLLM is using nccl==2.19.3
(RayWorkerWrapper pid=3960) INFO 01-08 18:35:35 pynccl_utils.py:43] vLLM is using nccl==2.19.3
ERROR 01-08 18:35:35.375 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
    pynccl_utils.init_process_group()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
    comm = NCCLCommunicator(group=group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
    NCCL_CHECK(
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
    engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
    engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 382, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 45, in _init_executor
    self._init_workers_ray(placement_group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 181, in _init_workers_ray
    self._run_workers("init_device")
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 158, in execute_method
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
    pynccl_utils.init_process_group()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
    comm = NCCLCommunicator(group=group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
    NCCL_CHECK(
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] Traceback (most recent call last):
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     pynccl_utils.init_process_group()
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     comm = NCCLCommunicator(group=group)
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     NCCL_CHECK(
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     raise RuntimeError(f"NCCL error: {error_str}")
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
*** SIGSEGV received at time=1736361335 on cpu 40 ***
PC: @     0x7f1ca0a69ab8  (unknown)  ncclProxyService()
    @     0x7f1d49518520  (unknown)  (unknown)
[2025-01-08 18:35:35,877 E 31 4136] logging.cc:365: *** SIGSEGV received at time=1736361335 on cpu 40 ***
[2025-01-08 18:35:35,877 E 31 4136] logging.cc:365: PC: @     0x7f1ca0a69ab8  (unknown)  ncclProxyService()
[2025-01-08 18:35:35,878 E 31 4136] logging.cc:365:     @     0x7f1d49518520  (unknown)  (unknown)
Fatal Python error: Segmentation fault


Extension modules: ujson, charset_normalizer.md, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, cuda._lib.utils, cuda._cuda.ccuda, cuda.ccuda, cuda.cuda, cuda._lib.ccudart.utils, cuda._lib.ccudart.ccudart, cuda.ccudart, cuda.cudart, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, mpi4py.MPI, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, markupsafe._speedups (total: 131)
[d21b9b0b3aed:00031] *** Process received signal ***
[d21b9b0b3aed:00031] Signal: Segmentation fault (11)
[d21b9b0b3aed:00031] Signal code:  (-6)
[d21b9b0b3aed:00031] Failing at address: 0x3e80000001f
[d21b9b0b3aed:00031] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1d49518520]
[d21b9b0b3aed:00031] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f1d4956c9fc]
[d21b9b0b3aed:00031] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f1d49518476]
[d21b9b0b3aed:00031] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1d49518520]
[d21b9b0b3aed:00031] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f1d4956c9fc]
[d21b9b0b3aed:00031] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f1d49518476]
[d21b9b0b3aed:00031] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1d49518520]
[d21b9b0b3aed:00031] [ 7] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2(+0x69ab8)[0x7f1ca0a69ab8]
[d21b9b0b3aed:00031] [ 8] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f1d4956aac3]
[d21b9b0b3aed:00031] [ 9] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f1d495fc850]
/opt/nim/start-server.sh: line 61:    31 Segmentation fault      (core dumped) python3 -m vllm_nvext.entrypoints.openai.api_server

P.S.. In wsl, though I did run export NCCL_DEBUG=INFO and run the above docker command, in the log message, it still says that, 'unhandled cuda error (run with NCCL_DEBUG=INFO for details)`.

sjeaugey · 2025-01-10T09:22:59Z

Indeed, it doesn't look like the environment variable is taken into account; I don't any NCCL INFO log.

I'm not sure what's wrong in your script though.

innat-asj · 2025-01-13T07:09:11Z

I was able to run NIM on a native Ubuntu setup. Is there any possibility that NIM might not work on WSL?

innat-asj mentioned this issue Jan 8, 2025

RuntimeError: NCCL error: unhandled cuda error NVIDIA/nim-anywhere#69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running issue with Nvidia NIM for NCCL. #1567

Running issue with Nvidia NIM for NCCL. #1567

innat-asj commented Jan 7, 2025 •

edited

Loading

sjeaugey commented Jan 8, 2025

innat-asj commented Jan 8, 2025 •

edited

Loading

sjeaugey commented Jan 10, 2025

innat-asj commented Jan 13, 2025

Running issue with Nvidia NIM for NCCL. #1567

Running issue with Nvidia NIM for NCCL. #1567

Comments

innat-asj commented Jan 7, 2025 • edited Loading

sjeaugey commented Jan 8, 2025

innat-asj commented Jan 8, 2025 • edited Loading

sjeaugey commented Jan 10, 2025

innat-asj commented Jan 13, 2025

innat-asj commented Jan 7, 2025 •

edited

Loading

innat-asj commented Jan 8, 2025 •

edited

Loading