-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seg fault
with *** Process received signal ***
#196
Comments
So this is odd. After an initial compilation some of the tests that had previously failed are passing using cached kernels. Again I still think that this has to do with compiler issues but we will see.... |
Ok so I think I was running into similar issues with the roc port and the soultion was a specific version of |
@braxtoncuneo can you comment on if this is the same issue you are seeing on Lassen? |
GPU Interop. Two planned PRs: (1) GPU regression test #196 and (2) GPU-related installation. @jpmorgan98 @clemekay @braxtoncuneo
Reproduced my segfault. This is what I got:
|
I am also getting this error on OSU's DGX system when running in MPI+GPU mode, it doesn't happen when running on a single GPU (non-GPU job) makes me think it's an MPI issue. |
So in the OSU CI machine cretin numba problems would copmile but fail to run. This happened on a number of the regression tests as well that where passing in the gh action runner. The full error is here:
Whenever I see errors like
lib64/libc.so.6
my mind immediately goes to incompatible compiler issues. First thing I tried asand that fixed it for some problems but still resulted in a seg fault for others specifically in the regression tests. I am running this in a manual terminal right now but eventually this will be the env that we do gh actions on for GPU regression testing. I am going to try other modules that have g++ and maybe look at llvm versions.
One thing to emphasize is this does seem like a runtime issue, not a compilation failure
The text was updated successfully, but these errors were encountered: