Does Torch Support NPU Architectures like Ascend MDC910B and Multi-GPU Quantization for Large Models? #1405

Lenan22 · 2024-12-12T02:25:43Z

Is Torch limited to running on CUDA, or does it also support NPU architectures like Ascend MDC910B? Additionally, is it possible to use multi-GPUs for large model quantization when a single 80GB GPU is insufficient to handle a model with runtime memory exceeding 80GB?

jerryzh168 · 2024-12-12T03:30:30Z

for NPU backend: not right now, but we do want to expand the backends we support: #1082

for using multi-GPU for quantization, can you be a bit more specific? are you talking about quantizing a large model that does not fit into the memory of a single GPU? I remember @kwen2501 mentioned that we can use Pipeline parallelism for that.

Lenan22 · 2024-12-12T05:40:26Z

for NPU backend: not right now, but we do want to expand the backends we support: #1082

for using multi-GPU for quantization, can you be a bit more specific? are you talking about quantizing a large model that does not fit into the memory of a single GPU? I remember @kwen2501 mentioned that we can use Pipeline parallelism for that.

I only have an A100 with 40G of memory at present. However, when I run a model that requires 85G of runtime memory, the GPU memory is not enough. What should I do? Are there any ways to reduce the memory usage?

kwen2501 · 2024-12-12T07:34:51Z

You can use either Tensor Parallel or Pipeline Parallel sub packages from torch to shard the model onto different devices.

jerryzh168 · 2024-12-12T19:31:44Z

for NPU backend: not right now, but we do want to expand the backends we support: #1082
for using multi-GPU for quantization, can you be a bit more specific? are you talking about quantizing a large model that does not fit into the memory of a single GPU? I remember @kwen2501 mentioned that we can use Pipeline parallelism for that.

I only have an A100 with 40G of memory at present. However, when I run a model that requires 85G of runtime memory, the GPU memory is not enough. What should I do? Are there any ways to reduce the memory usage?

one thing to check is to use

ao/torchao/quantization/quant_api.py

Line 464 in 19b3bb5

    
                   device (device, optional): Device to move module to before applying `filter_fn`. This can be set to `"cuda"` to speed up quantization. The final model will be on the specified `device`.

cc @gau-nernst might be good to add a bit more docs for this feature (e.g. attaching the benchmarking numbers you have before in some issues)

drisspg added the multibackend label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Torch Support NPU Architectures like Ascend MDC910B and Multi-GPU Quantization for Large Models? #1405

Does Torch Support NPU Architectures like Ascend MDC910B and Multi-GPU Quantization for Large Models? #1405

Lenan22 commented Dec 12, 2024

jerryzh168 commented Dec 12, 2024

Lenan22 commented Dec 12, 2024

kwen2501 commented Dec 12, 2024

jerryzh168 commented Dec 12, 2024

Does Torch Support NPU Architectures like Ascend MDC910B and Multi-GPU Quantization for Large Models? #1405

Does Torch Support NPU Architectures like Ascend MDC910B and Multi-GPU Quantization for Large Models? #1405

Comments

Lenan22 commented Dec 12, 2024

jerryzh168 commented Dec 12, 2024

Lenan22 commented Dec 12, 2024

kwen2501 commented Dec 12, 2024

jerryzh168 commented Dec 12, 2024