-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPUDirect RDMA #18
Comments
Actually I've discovered that Comet's OFED stack might not support GDR. |
Actually, scratch that, this is on me. Small data (rightly) is attempted to be sent inline, (meaning a memcpy), but that obviously doesn't work if that memory is on a CUDA device. I can add a check in my code to ensure the pointer isn't on a CUDA device. To be honest, does it make sense to get rid of this check in |
Not completely understood the issue, can you answer the questions:
|
As long as I make sure to send a buffer larger than the
Yes. My thought is that if someone is choosing to use the direct communication type, they are intentionally opting into RDMA—we shouldn't hide this inline decision from them. The workaround for GPUDirect RDMA is to a) detect if the buffer is in GPU memory and b) if so, either ignore the inline check or first copy it to host memory. |
I get your point, though this is purely implementation choice since the interface does not (yet) tell whether a registration to be performed. The fact is that the buffer is small, so you don’t need to register it before you send, you may — it is wasting cycles. Maybe a better choice would be to say the user needs to register the buffer with the runtime first, then we just get the lkey from the user or from registration table. The fix can be simple as adding a condition, if it is a gpu buffer then register anyway. |
I'm having trouble getting GDR working. My current understanding is that the way GDR works is that the InfiniBand driver has a plugin that interacts with the CUDA driver and runtime and the IB Verbs memory registration/deregistration functions are extended to be aware of GPU memory. From the application perspective, it doesn't really need to change anything; it can just pass a CUDA device pointer into the memory registration function and then use it for RDMA.
I'm having trouble getting this to work though, with a segfault resulting from
ibv_post_send
after the sendd/recvd rendezvous. Specifically, the segfault is the libc memcpy implementation; this indicates to me that I'm either missing something or GDR isn't set up right.Vu, do you have any idea what could be going on? I unfortunately only have access to Comet's GPU nodes, so I can't try another platform.
The text was updated successfully, but these errors were encountered: