Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Torch Support NPU Architectures like Ascend MDC910B and Multi-GPU Quantization for Large Models? #1405

Open
Lenan22 opened this issue Dec 12, 2024 · 4 comments

Comments

@Lenan22
Copy link

Lenan22 commented Dec 12, 2024

Is Torch limited to running on CUDA, or does it also support NPU architectures like Ascend MDC910B? Additionally, is it possible to use multi-GPUs for large model quantization when a single 80GB GPU is insufficient to handle a model with runtime memory exceeding 80GB?

@jerryzh168
Copy link
Contributor

for NPU backend: not right now, but we do want to expand the backends we support: #1082

for using multi-GPU for quantization, can you be a bit more specific? are you talking about quantizing a large model that does not fit into the memory of a single GPU? I remember @kwen2501 mentioned that we can use Pipeline parallelism for that.

@Lenan22
Copy link
Author

Lenan22 commented Dec 12, 2024

for NPU backend: not right now, but we do want to expand the backends we support: #1082

for using multi-GPU for quantization, can you be a bit more specific? are you talking about quantizing a large model that does not fit into the memory of a single GPU? I remember @kwen2501 mentioned that we can use Pipeline parallelism for that.

I only have an A100 with 40G of memory at present. However, when I run a model that requires 85G of runtime memory, the GPU memory is not enough. What should I do? Are there any ways to reduce the memory usage?

@kwen2501
Copy link
Contributor

You can use either Tensor Parallel or Pipeline Parallel sub packages from torch to shard the model onto different devices.

@jerryzh168
Copy link
Contributor

for NPU backend: not right now, but we do want to expand the backends we support: #1082
for using multi-GPU for quantization, can you be a bit more specific? are you talking about quantizing a large model that does not fit into the memory of a single GPU? I remember @kwen2501 mentioned that we can use Pipeline parallelism for that.

I only have an A100 with 40G of memory at present. However, when I run a model that requires 85G of runtime memory, the GPU memory is not enough. What should I do? Are there any ways to reduce the memory usage?

one thing to check is to use

device (device, optional): Device to move module to before applying `filter_fn`. This can be set to `"cuda"` to speed up quantization. The final model will be on the specified `device`.

cc @gau-nernst might be good to add a bit more docs for this feature (e.g. attaching the benchmarking numbers you have before in some issues)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants