Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg fault with *** Process received signal *** #196

Open
jpmorgan98 opened this issue May 7, 2024 · 5 comments
Open

Seg fault with *** Process received signal *** #196

jpmorgan98 opened this issue May 7, 2024 · 5 comments
Assignees
Labels
hpc Issues relating to HPC deployments

Comments

@jpmorgan98
Copy link
Collaborator

So in the OSU CI machine cretin numba problems would copmile but fail to run. This happened on a number of the regression tests as well that where passing in the gh action runner. The full error is here:

(mcdc_dev) cement ~/workspace/MCDC/examples/fixed_source/slab_absorbium 1026$ python input.py --mode=numba
  __  __  ____  __ ____   ____ 
 |  \/  |/ ___|/ /_  _ \ / ___|
 | |\/| | |   /_  / | | | |    
 | |  | | |___ / /| |_| | |___ 
 |_|  |_|\____|// |____/ \____|

           Mode | Numba
      Algorithm | History-based
  MPI Processes | 1
 OpenMP Threads | 1
 Now running TNT...
[cement:17804] *** Process received signal ***
[cement:17804] Signal: Segmentation fault (11)
[cement:17804] Signal code: Address not mapped (1)
[cement:17804] Failing at address: 0x256990c7fa14
[cement:17804] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7fb3607a9630]
[cement:17804] [ 1] [0x7fb2ac5d2160]
[cement:17804] [ 2] [0x7fb2abf790a6]
[cement:17804] [ 3] [0x7fb2ac93d32b]
[cement:17804] [ 4] [0x7fb2ac6ac375]
[cement:17804] [ 5] [0x7fb2a6c13443]
[cement:17804] [ 6] [0x7fb2a6c1381e]
[cement:17804] [ 7] /nfs/stak/users/morgajoa/miniconda3/envs/mcdc_dev/lib/python3.11/site-packages/numba/_dispatcher.cpython-311-x86_64-linux-gnu.so(+0x53f4)[0x7fb3555cc3f4]
[cement:17804] [ 8] /nfs/stak/users/morgajoa/miniconda3/envs/mcdc_dev/lib/python3.11/site-packages/numba/_dispatcher.cpython-311-x86_64-linux-gnu.so(+0x5712)[0x7fb3555cc712]
[cement:17804] [ 9] python(_PyObject_MakeTpCall+0x26c)[0x5041ac]
[cement:17804] [10] python(_PyEval_EvalFrameDefault+0x6a7)[0x5116e7]
[cement:17804] [11] python[0x5cbeda]
[cement:17804] [12] python(PyEval_EvalCode+0x9f)[0x5cb5af]
[cement:17804] [13] python[0x5ec6a7]
[cement:17804] [14] python[0x5e8240]
[cement:17804] [15] python[0x5fd192]
[cement:17804] [16] python(_PyRun_SimpleFileObject+0x19f)[0x5fc55f]
[cement:17804] [17] python(_PyRun_AnyFileObject+0x43)[0x5fc283]
[cement:17804] [18] python(Py_RunMain+0x2ee)[0x5f6efe]
[cement:17804] [19] python(Py_BytesMain+0x39)[0x5bbc79]
[cement:17804] [20] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb35fce5555]
[cement:17804] [21] python[0x5bbac3]
[cement:17804] *** End of error message ***
Segmentation fault (core dumped)

Whenever I see errors like lib64/libc.so.6 my mind immediately goes to incompatible compiler issues. First thing I tried as

conda install -c conda-forge gxx

and that fixed it for some problems but still resulted in a seg fault for others specifically in the regression tests. I am running this in a manual terminal right now but eventually this will be the env that we do gh actions on for GPU regression testing. I am going to try other modules that have g++ and maybe look at llvm versions.

One thing to emphasize is this does seem like a runtime issue, not a compilation failure

@jpmorgan98 jpmorgan98 added the hpc Issues relating to HPC deployments label May 7, 2024
@jpmorgan98 jpmorgan98 self-assigned this May 7, 2024
@jpmorgan98
Copy link
Collaborator Author

So this is odd. After an initial compilation some of the tests that had previously failed are passing using cached kernels. Again I still think that this has to do with compiler issues but we will see....

@jpmorgan98
Copy link
Collaborator Author

Ok so I think I was running into similar issues with the roc port and the soultion was a specific version of libgcc-ng which is installed when conda install gxx

@jpmorgan98 jpmorgan98 mentioned this issue May 8, 2024
@jpmorgan98
Copy link
Collaborator Author

@braxtoncuneo can you comment on if this is the same issue you are seeing on Lassen?

ilhamv added a commit that referenced this issue May 8, 2024
GPU Interop. Two planned PRs: (1) GPU regression test #196 and (2) GPU-related installation. @jpmorgan98 @clemekay @braxtoncuneo
@braxtoncuneo
Copy link
Collaborator

@braxtoncuneo can you comment on if this is the same issue you are seeing on Lassen?

Reproduced my segfault. This is what I got:


  __  __  ____  __ ____   ____ 
 |  \/  |/ ___|/ /_  _ \ / ___|
 | |\/| | |   /_  / | | | |    
 | |  | | |___ / /| |_| | |___ 
 |_|  |_|\____|// |____/ \____|

           Mode | Numba
      Algorithm | History-based
  MPI Processes | 1
 OpenMP Threads | 1
 Now running TNT...
Segmentation fault

@jpmorgan98
Copy link
Collaborator Author

I am also getting this error on OSU's DGX system when running in MPI+GPU mode, it doesn't happen when running on a single GPU (non-GPU job) makes me think it's an MPI issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hpc Issues relating to HPC deployments
Projects
None yet
Development

No branches or pull requests

2 participants