Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threading + MPI #974

Open
antoine-levitt opened this issue May 23, 2024 · 7 comments
Open

Threading + MPI #974

antoine-levitt opened this issue May 23, 2024 · 7 comments

Comments

@antoine-levitt
Copy link
Member

I've had this happen when running DFTK from within threads. I'm not too clear on what we should do here.

ERROR: LoadError: TaskFailedException

    nested task error: UndefRefError: access to undefined reference
    Stacktrace:
      [1] getindex
        @ ./essentials.jl:892 [inlined]
      [2] popfirst!
        @ ./array.jl:1706 [inlined]
      [3] run_init_hooks()
        @ MPI ~/.julia/packages/MPI/rwDDn/src/environment.jl:65
      [4] Init(; threadlevel::Symbol, finalize_atexit::Bool, errors_return::Bool)
        @ MPI ~/.julia/packages/MPI/rwDDn/src/environment.jl:155
      [5] Init
        @ ~/.julia/packages/MPI/rwDDn/src/environment.jl:114 [inlined]
      [6] PlaneWaveBasis(model::Model{…}, Ecut::Float64, fft_size::Tuple{…}, variational::Bool, kgrid::MonkhorstPack, symmetries_respect_rgrid::Bool, use_symmetries_for_kpoint_reduction::Bool, comm_kpts::MPI.Comm, architecture::DFTK.CPU)
        @ DFTK ~/.julia/dev/DFTK/src/PlaneWaveBasis.jl:247
      [7] #PlaneWaveBasis#141
        @ ~/.julia/dev/DFTK/src/PlaneWaveBasis.jl:399 [inlined]
      [8] setup_calculation(s::Int64, n_electrons::Int64, b::Int64, α::Int64; scaling::Symbol, α_q::Int64, α_r::Int64)
        @ Main ~/Dropbox/recherche/2020-11-anyons/new/functions.jl:239
      [9] setup_calculation
        @ ~/Dropbox/recherche/2020-11-anyons/new/functions.jl:207 [inlined]
     [10] 
        @ Main ~/Dropbox/recherche/2020-11-anyons/new/functions.jl:244
     [11] macro expansion
        @ ~/Dropbox/recherche/2020-11-anyons/new/compute.jl:25 [inlined]
     [12] (::var"#33#threadsfor_fun#23"{Int64, Int64, String, Channel{Int64}})(tid::Int64)
        @ Main ./threadingconstructs.jl:209
     [13] (::Base.Threads.var"#1#2"{var"#33#threadsfor_fun#23"{Int64, Int64, String, Channel{Int64}}, Int64})()
        @ Base.Threads ./threadingconstructs.jl:154
    Some type information was truncated. Use `show(err)` to see complete types.

...and 5 more exceptions.

Stacktrace:
 [1] threading_run(fun::var"#33#threadsfor_fun#23"{Int64, Int64, String, Channel{Int64}}, static::Bool)
   @ Base.Threads ./threadingconstructs.jl:172
 [2] macro expansion
   @ ./threadingconstructs.jl:189 [inlined]
 [3] top-level scope
   @ ~/Dropbox/recherche/2020-11-anyons/new/compute.jl:21
@epolack
Copy link
Collaborator

epolack commented May 23, 2024

I remember being able to do launch it in a quick and dirty way, but I am not so sure anymore…

On a local branch I enabled switching off the three parts where Threads is used.

@antoine-levitt
Copy link
Member Author

It works most of the times but I just had this happen once. Switching off you mean this? #972

@epolack
Copy link
Collaborator

epolack commented May 23, 2024

Right now, for me it works none of the time on another stuff I am doing…

Yes, I was indeed looking at 972 and looks like a lot what I am using for parallel phonons.

(I think I gave up looking at how to do thread in thread because of the @timing stuff.)

@antoine-levitt
Copy link
Member Author

(I think I gave up looking at how to do thread in thread because of the @timing stuff.)

Yeah, should we just disable this by default?

@epolack
Copy link
Collaborator

epolack commented May 23, 2024

I have never used the fact that it's enabled by default. I've always found this surprising.

@mfherbst
Copy link
Member

I've had this happen when running DFTK from within threads.

I think this is because MPI is initialised twice. We should put the initialisation call around a semaphore or signal MPI in the way we initialise it that it could be called from multiple threads (I think it has a flag to do that).

@Technici4n
Copy link
Contributor

Technici4n commented Jan 20, 2025

I ran into this during my master's thesis, specifically in situation 3 below.

I see three typical situations here:

  1. DFTK is running using a hybrid parallelization strategy (i.e. parallelization over multiple threads per MPI rank).
  2. DFTK is using thread parallelism, but there is a single MPI rank.
  3. Multiple computations are running in different threads, but each computation uses a single thread, and there is a single MPI rank.

For 1 and 2, MPI_THREAD_FUNNELED should be sufficient. The default is MPI_THREAD_SERIALIZED which is a higher level, so fine.

For 3, it is not clear to me whether we need to initialize with MPI_THREAD_MULTIPLE. We do perform MPI calls from multiple threads concurrently, however they are all trivial. Assuminng that the default level is fine, we still run into the issue that the way MPI.jl guards against multiple initializations is not thread-safe. The pseudocode looks like this:

if (!initialized) initialize()

which can obviously lead to multiple threads trying to initialize MPI at the same time.

A solution would be to wrap all our calls to MPI.Init in a thread-safe helper that will ensure a single call to MPI.jl is made. E.g. similar to Suppliers.memoize() in Guava.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants