Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running issue with Nvidia NIM for NCCL. #1567

Open
innat-asj opened this issue Jan 7, 2025 · 4 comments
Open

Running issue with Nvidia NIM for NCCL. #1567

innat-asj opened this issue Jan 7, 2025 · 4 comments

Comments

@innat-asj
Copy link

innat-asj commented Jan 7, 2025

I'm trying to run nvidia NIM locally on windows using wsl2. It contains multi-gpu, each having around 48GB vram. While running the docker for nvidia-nim, I got following NCCL issue.

# check wsl2 version
lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.1 LTS
Release:        24.04
Codename:       noble
export NGC_API_KEY=nvapi-...
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY=$NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8080:8000 \
    nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

Error logs

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2025-01-07 18:58:17,892 [INFO] PyTorch version 2.2.2 available.
2025-01-07 18:58:18,460 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2025-01-07 18:58:18,460 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2025-01-07 18:58:18,637 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 01-07 18:58:19.318 api_server.py:489] NIM LLM API version 1.0.0
INFO 01-07 18:58:19.320 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-07 18:58:19.321 ngc_profile.py:219] Detected 2 compatible profile(s).
INFO 01-07 18:58:19.321 ngc_injector.py:106] Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2]
INFO 01-07 18:58:19.321 ngc_injector.py:106] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2]
INFO 01-07 18:58:19.321 ngc_injector.py:141] Selected profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
INFO 01-07 18:58:21.325 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 01-07 18:58:21.325 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 01-07 18:58:21.325 ngc_injector.py:146] Profile metadata: llm_engine: vllm
INFO 01-07 18:58:21.325 ngc_injector.py:146] Profile metadata: tp: 2
INFO 01-07 18:58:21.325 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 01-07 18:58:27.500 ngc_injector.py:172] Model workspace is now ready. It took 6.175 seconds
2025-01-07 18:58:29,450 WARNING services.py:2009 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-01-07 18:58:29,577 INFO worker.py:1749 -- Started a local Ray instance.
INFO 01-07 18:58:30.763 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-c7inq7wy', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-c7inq7wy', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 01-07 18:58:30.949 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-07 18:58:34.233 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
(RayWorkerWrapper pid=3970) INFO 01-07 18:58:34 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
WARNING 01-07 18:58:35.850 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 01-07 18:58:35 selector.py:28] Using FlashAttention backend.
(RayWorkerWrapper pid=3970) WARNING 01-07 18:58:35 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(RayWorkerWrapper pid=3970) INFO 01-07 18:58:35 selector.py:28] Using FlashAttention backend.
INFO 01-07 18:58:36 pynccl_utils.py:43] vLLM is using nccl==2.19.3
(RayWorkerWrapper pid=3970) INFO 01-07 18:58:36 pynccl_utils.py:43] vLLM is using nccl==2.19.3
ERROR 01-07 18:58:36.315 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
    pynccl_utils.init_process_group()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
    comm = NCCLCommunicator(group=group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
    NCCL_CHECK(
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
    engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
    engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 382, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 45, in _init_executor
    self._init_workers_ray(placement_group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 181, in _init_workers_ray
    self._run_workers("init_device")
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 158, in execute_method
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
    pynccl_utils.init_process_group()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
    comm = NCCLCommunicator(group=group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
    NCCL_CHECK(
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

In the log message, it says, it is using vllm which uses nccl==2.19.3. So based for the CUDA compatibility, I installed appropriate CUDA compiler, 12.3.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
nvidia-smi

Wed Jan  8 04:06:49 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.02              Driver Version: 566.03         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:01:00.0 Off |                  Off |
| 30%   22C    P8             12W /  300W |       0MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:C1:00.0  On |                  Off |
| 30%   22C    P8             11W /  300W |     838MiB /  49140MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:C2:00.0 Off |                  Off |
| 30%   22C    P8              9W /  300W |       0MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        24      G   /Xwayland                                   N/A      |
|    0   N/A  N/A        40      G   /Xwayland                                   N/A      |
|    1   N/A  N/A        24      G   /Xwayland                                   N/A      |
|    1   N/A  N/A        40      G   /Xwayland                                   N/A      |
|    2   N/A  N/A        24      G   /Xwayland                                   N/A      |
|    2   N/A  N/A        40      G   /Xwayland                                   N/A      |
@sjeaugey
Copy link
Member

sjeaugey commented Jan 8, 2025

As the error log says: NCCL got a CUDA error it could not recover from; it would be good to run again setting NCCL_DEBUG=INFO to get an explanation as to what CUDA call failed, which which error. That would probably tell us what the problem is.

Also I noticed Wayland is running on the GPU; that could severely impact the performance of CUDA (since it would have to time share with Xwayland).

@innat-asj
Copy link
Author

innat-asj commented Jan 8, 2025

Please note that, with the same setup (CUDA, NCCL), using Huggingface API, I can successfully run data-parallel and model-parallel level computation without encountering such issue (NCCL conflict with CUDA). Using torch API, I can ensure the following also:

Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.distributed.is_nccl_available()
True

But when running the Nvidia NIM docker on the same environment, I got this unhandled CUDA NCCL error. I'm bit puzzle here, if its all about my env settings and configuration, torch should also complain, but it doesn't.

As you suggested, I ran with NCCL_DEBUG=INFO. Here is the full traceback.

docker run -it --rm \
    --gpus all \
    --shm-size=16GB \
    -e NCCL_DEBUG=INFO \
    -e NGC_API_KEY=$NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8080:8000 \
    nvcr.io/nim/meta/llama3-8b-instruct:latest

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.3
Model: meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2025-01-08 18:35:25,541 [INFO] PyTorch version 2.2.2 available.
2025-01-08 18:35:26,080 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2025-01-08 18:35:26,080 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2025-01-08 18:35:26,207 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 01-08 18:35:26.807 api_server.py:489] NIM LLM API version 1.0.0
INFO 01-08 18:35:26.809 ngc_profile.py:218] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-08 18:35:26.809 ngc_profile.py:220] Detected 2 compatible profile(s).
INFO 01-08 18:35:26.810 ngc_injector.py:107] Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2]
INFO 01-08 18:35:26.810 ngc_injector.py:107] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2]
INFO 01-08 18:35:26.810 ngc_injector.py:142] Selected profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: feat_lora: false
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: llm_engine: vllm
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: precision: fp16
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: tp: 2
INFO 01-08 18:35:26.811 ngc_injector.py:167] Preparing model workspace. This step might download additional files to run the model.
INFO 01-08 18:35:26.813 ngc_injector.py:173] Model workspace is now ready. It took 0.002 seconds
2025-01-08 18:35:28,883 INFO worker.py:1749 -- Started a local Ray instance.
INFO 01-08 18:35:30.197 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-kve8c9q8', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-kve8c9q8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 01-08 18:35:30.397 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-08 18:35:33.867 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
(RayWorkerWrapper pid=3960) INFO 01-08 18:35:33 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
WARNING 01-08 18:35:34.744 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 01-08 18:35:34 selector.py:28] Using FlashAttention backend.
(RayWorkerWrapper pid=3960) WARNING 01-08 18:35:34 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(RayWorkerWrapper pid=3960) INFO 01-08 18:35:34 selector.py:28] Using FlashAttention backend.
INFO 01-08 18:35:35 pynccl_utils.py:43] vLLM is using nccl==2.19.3
(RayWorkerWrapper pid=3960) INFO 01-08 18:35:35 pynccl_utils.py:43] vLLM is using nccl==2.19.3
ERROR 01-08 18:35:35.375 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
    pynccl_utils.init_process_group()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
    comm = NCCLCommunicator(group=group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
    NCCL_CHECK(
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
    engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
    engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 382, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 45, in _init_executor
    self._init_workers_ray(placement_group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 181, in _init_workers_ray
    self._run_workers("init_device")
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 158, in execute_method
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
    pynccl_utils.init_process_group()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
    comm = NCCLCommunicator(group=group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
    NCCL_CHECK(
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] Traceback (most recent call last):
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     pynccl_utils.init_process_group()
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     comm = NCCLCommunicator(group=group)
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     NCCL_CHECK(
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157]     raise RuntimeError(f"NCCL error: {error_str}")
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
*** SIGSEGV received at time=1736361335 on cpu 40 ***
PC: @     0x7f1ca0a69ab8  (unknown)  ncclProxyService()
    @     0x7f1d49518520  (unknown)  (unknown)
[2025-01-08 18:35:35,877 E 31 4136] logging.cc:365: *** SIGSEGV received at time=1736361335 on cpu 40 ***
[2025-01-08 18:35:35,877 E 31 4136] logging.cc:365: PC: @     0x7f1ca0a69ab8  (unknown)  ncclProxyService()
[2025-01-08 18:35:35,878 E 31 4136] logging.cc:365:     @     0x7f1d49518520  (unknown)  (unknown)
Fatal Python error: Segmentation fault


Extension modules: ujson, charset_normalizer.md, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, cuda._lib.utils, cuda._cuda.ccuda, cuda.ccuda, cuda.cuda, cuda._lib.ccudart.utils, cuda._lib.ccudart.ccudart, cuda.ccudart, cuda.cudart, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, mpi4py.MPI, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, markupsafe._speedups (total: 131)
[d21b9b0b3aed:00031] *** Process received signal ***
[d21b9b0b3aed:00031] Signal: Segmentation fault (11)
[d21b9b0b3aed:00031] Signal code:  (-6)
[d21b9b0b3aed:00031] Failing at address: 0x3e80000001f
[d21b9b0b3aed:00031] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1d49518520]
[d21b9b0b3aed:00031] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f1d4956c9fc]
[d21b9b0b3aed:00031] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f1d49518476]
[d21b9b0b3aed:00031] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1d49518520]
[d21b9b0b3aed:00031] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f1d4956c9fc]
[d21b9b0b3aed:00031] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f1d49518476]
[d21b9b0b3aed:00031] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1d49518520]
[d21b9b0b3aed:00031] [ 7] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2(+0x69ab8)[0x7f1ca0a69ab8]
[d21b9b0b3aed:00031] [ 8] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f1d4956aac3]
[d21b9b0b3aed:00031] [ 9] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f1d495fc850]
/opt/nim/start-server.sh: line 61:    31 Segmentation fault      (core dumped) python3 -m vllm_nvext.entrypoints.openai.api_server

P.S.. In wsl, though I did run export NCCL_DEBUG=INFO and run the above docker command, in the log message, it still says that, 'unhandled cuda error (run with NCCL_DEBUG=INFO for details)`.

@sjeaugey
Copy link
Member

Indeed, it doesn't look like the environment variable is taken into account; I don't any NCCL INFO log.

I'm not sure what's wrong in your script though.

@innat-asj
Copy link
Author

I was able to run NIM on a native Ubuntu setup. Is there any possibility that NIM might not work on WSL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants