-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running issue with Nvidia NIM for NCCL. #1567
Comments
As the error log says: NCCL got a CUDA error it could not recover from; it would be good to run again setting NCCL_DEBUG=INFO to get an explanation as to what CUDA call failed, which which error. That would probably tell us what the problem is. Also I noticed Wayland is running on the GPU; that could severely impact the performance of CUDA (since it would have to time share with Xwayland). |
Please note that, with the same setup (CUDA, NCCL), using Huggingface API, I can successfully run data-parallel and model-parallel level computation without encountering such issue (NCCL conflict with CUDA). Using torch API, I can ensure the following also: Python 3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.distributed.is_nccl_available()
True But when running the Nvidia NIM docker on the same environment, I got this unhandled CUDA NCCL error. I'm bit puzzle here, if its all about my env settings and configuration, torch should also complain, but it doesn't. As you suggested, I ran with NCCL_DEBUG=INFO. Here is the full traceback. docker run -it --rm \
--gpus all \
--shm-size=16GB \
-e NCCL_DEBUG=INFO \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8080:8000 \
nvcr.io/nim/meta/llama3-8b-instruct:latest
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.3
Model: meta/llama3-8b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
2025-01-08 18:35:25,541 [INFO] PyTorch version 2.2.2 available.
2025-01-08 18:35:26,080 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2025-01-08 18:35:26,080 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2025-01-08 18:35:26,207 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 01-08 18:35:26.807 api_server.py:489] NIM LLM API version 1.0.0
INFO 01-08 18:35:26.809 ngc_profile.py:218] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-08 18:35:26.809 ngc_profile.py:220] Detected 2 compatible profile(s).
INFO 01-08 18:35:26.810 ngc_injector.py:107] Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2]
INFO 01-08 18:35:26.810 ngc_injector.py:107] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2]
INFO 01-08 18:35:26.810 ngc_injector.py:142] Selected profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: feat_lora: false
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: llm_engine: vllm
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: precision: fp16
INFO 01-08 18:35:26.811 ngc_injector.py:147] Profile metadata: tp: 2
INFO 01-08 18:35:26.811 ngc_injector.py:167] Preparing model workspace. This step might download additional files to run the model.
INFO 01-08 18:35:26.813 ngc_injector.py:173] Model workspace is now ready. It took 0.002 seconds
2025-01-08 18:35:28,883 INFO worker.py:1749 -- Started a local Ray instance.
INFO 01-08 18:35:30.197 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-kve8c9q8', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-kve8c9q8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 01-08 18:35:30.397 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-08 18:35:33.867 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
(RayWorkerWrapper pid=3960) INFO 01-08 18:35:33 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
WARNING 01-08 18:35:34.744 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 01-08 18:35:34 selector.py:28] Using FlashAttention backend.
(RayWorkerWrapper pid=3960) WARNING 01-08 18:35:34 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(RayWorkerWrapper pid=3960) INFO 01-08 18:35:34 selector.py:28] Using FlashAttention backend.
INFO 01-08 18:35:35 pynccl_utils.py:43] vLLM is using nccl==2.19.3
(RayWorkerWrapper pid=3960) INFO 01-08 18:35:35 pynccl_utils.py:43] vLLM is using nccl==2.19.3
ERROR 01-08 18:35:35.375 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
return executor(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
init_worker_distributed_environment(self.parallel_config, self.rank,
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
pynccl_utils.init_process_group()
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
comm = NCCLCommunicator(group=group)
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
NCCL_CHECK(
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 382, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 45, in _init_executor
self._init_workers_ray(placement_group)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 181, in _init_workers_ray
self._run_workers("init_device")
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
driver_worker_output = self.driver_worker.execute_method(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 158, in execute_method
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
return executor(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
init_worker_distributed_environment(self.parallel_config, self.rank,
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
pynccl_utils.init_process_group()
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
comm = NCCLCommunicator(group=group)
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
NCCL_CHECK(
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] Traceback (most recent call last):
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] return executor(*args, **kwargs)
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in init_device
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 300, in init_worker_distributed_environment
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] pynccl_utils.init_process_group()
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] comm = NCCLCommunicator(group=group)
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] NCCL_CHECK(
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] raise RuntimeError(f"NCCL error: {error_str}")
(RayWorkerWrapper pid=3960) ERROR 01-08 18:35:35 worker_base.py:157] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
*** SIGSEGV received at time=1736361335 on cpu 40 ***
PC: @ 0x7f1ca0a69ab8 (unknown) ncclProxyService()
@ 0x7f1d49518520 (unknown) (unknown)
[2025-01-08 18:35:35,877 E 31 4136] logging.cc:365: *** SIGSEGV received at time=1736361335 on cpu 40 ***
[2025-01-08 18:35:35,877 E 31 4136] logging.cc:365: PC: @ 0x7f1ca0a69ab8 (unknown) ncclProxyService()
[2025-01-08 18:35:35,878 E 31 4136] logging.cc:365: @ 0x7f1d49518520 (unknown) (unknown)
Fatal Python error: Segmentation fault
Extension modules: ujson, charset_normalizer.md, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, cuda._lib.utils, cuda._cuda.ccuda, cuda.ccuda, cuda.cuda, cuda._lib.ccudart.utils, cuda._lib.ccudart.ccudart, cuda.ccudart, cuda.cudart, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, mpi4py.MPI, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, markupsafe._speedups (total: 131)
[d21b9b0b3aed:00031] *** Process received signal ***
[d21b9b0b3aed:00031] Signal: Segmentation fault (11)
[d21b9b0b3aed:00031] Signal code: (-6)
[d21b9b0b3aed:00031] Failing at address: 0x3e80000001f
[d21b9b0b3aed:00031] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1d49518520]
[d21b9b0b3aed:00031] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f1d4956c9fc]
[d21b9b0b3aed:00031] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f1d49518476]
[d21b9b0b3aed:00031] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1d49518520]
[d21b9b0b3aed:00031] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f1d4956c9fc]
[d21b9b0b3aed:00031] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f1d49518476]
[d21b9b0b3aed:00031] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1d49518520]
[d21b9b0b3aed:00031] [ 7] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2(+0x69ab8)[0x7f1ca0a69ab8]
[d21b9b0b3aed:00031] [ 8] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f1d4956aac3]
[d21b9b0b3aed:00031] [ 9] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f1d495fc850]
/opt/nim/start-server.sh: line 61: 31 Segmentation fault (core dumped) python3 -m vllm_nvext.entrypoints.openai.api_server P.S.. In wsl, though I did run |
Indeed, it doesn't look like the environment variable is taken into account; I don't any I'm not sure what's wrong in your script though. |
I was able to run NIM on a native Ubuntu setup. Is there any possibility that NIM might not work on WSL? |
I'm trying to run nvidia NIM locally on windows using wsl2. It contains multi-gpu, each having around 48GB vram. While running the docker for nvidia-nim, I got following NCCL issue.
# check wsl2 version lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 24.04.1 LTS Release: 24.04 Codename: noble
Error logs
In the log message, it says, it is using vllm which uses nccl==2.19.3. So based for the CUDA compatibility, I installed appropriate CUDA compiler, 12.3.
The text was updated successfully, but these errors were encountered: