-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does Torch Support NPU Architectures like Ascend MDC910B and Multi-GPU Quantization for Large Models? #1405
Comments
for NPU backend: not right now, but we do want to expand the backends we support: #1082 for using multi-GPU for quantization, can you be a bit more specific? are you talking about quantizing a large model that does not fit into the memory of a single GPU? I remember @kwen2501 mentioned that we can use Pipeline parallelism for that. |
I only have an A100 with 40G of memory at present. However, when I run a model that requires 85G of runtime memory, the GPU memory is not enough. What should I do? Are there any ways to reduce the memory usage? |
You can use either Tensor Parallel or Pipeline Parallel sub packages from torch to shard the model onto different devices. |
one thing to check is to use ao/torchao/quantization/quant_api.py Line 464 in 19b3bb5
cc @gau-nernst might be good to add a bit more docs for this feature (e.g. attaching the benchmarking numbers you have before in some issues) |
Is Torch limited to running on CUDA, or does it also support NPU architectures like Ascend MDC910B? Additionally, is it possible to use multi-GPUs for large model quantization when a single 80GB GPU is insufficient to handle a model with runtime memory exceeding 80GB?
The text was updated successfully, but these errors were encountered: