-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error at 256GPU/64Nodes on Lassen #777
Comments
Do you have the full exception message? (This is just the traceback leading up to the |
I could not find a raise in the batch job output. This appeared to be the only message out-of-the-ordinary when the job failed. Here's a gist with the entire output: |
What's happening there is that the transform code is running into a pattern of loop nests that it's not familiar with, see this code in @kaushikcfd's branch: The message I was looking for is
which seems very weird; it's a loop over only intra-element DOFs without a loop over elements surrounding it. As a near-term workaround, you can just hack that code to fall back to the more general single-grid work-balancing strategy also in the compiled-function case, in the code I linked. Beyond that, I would like to understand how we wound up with a lonely DOF loop. |
Thanks for the report.
It's safe to call this an error. However, I would expect
Am I missing the kernel to blame here?
I agree. @MTCam: Does the driver work correctly while running with lower #ranks? |
This compile exits abnormally after about 10 minutes. The actual, full compile should take more than an hour, as evidenced by the compile times for the lower rank runs:
Yes the driver runs successfully for a number of different meshes up to 6M elements. The difference between this run and the previous working run is that it is a factor of 2 more ranks and a factor of 2 more elements. |
Not sure I'm convinced that should be the case. Synchronizations/filesystem issues come to mind. (@inducer any guesses why this might be happening?) |
This timing looks suspiciously exponential in the number of ranks. I do not understand why that would be the case.
|
Are you suggesting this might be because of the increased comm pytato nodes? If so, the lines in the log file of the form:
would be helpful to validate that hypothesis for the 3 mesh sizes with increasing compile times. |
Wow, not sure I'd seen that. 136,095 nodes? That's kinda nuts. There's no reason for there to be that many of them, IMO. |
The timer starts at the first call to
Not offhand - but I do know it isn't going crazy. I will add a snippet to give us the break-down.
FWIW, the compile times are independent of mesh size. For a given RHS, the compile times increase with increasing numbers of ranks. This is not a new "feature" of distributed lazy. It has always been this way - although now it may be a little worse than before. I am not sure because we were never able to run our cases out this far before now. |
It's safe to call it a bug that ought to be fixed, since, purely based on principles we don't expect it. If you have access to the log files for the 3 mesh sizes that would be helpful to debug this behavior. |
@inducer : I am going to post a running report on the number of rank neighbors for runs. As the data comes in, I'll edit this msg. I hope it can help us suss out whether the number of partition boundaries is playing a significant role.
|
1.5M elems, 32 ranks: https://gist.github.com/db1139f6db451f0cd921fab6a6d69223 Hope this helps! Thanks a ton for looking into this! |
Thanks, so here is the ranks vs number of pytato nodes:
So, at least it's not growing linearly and starts tapering off once our number of neighboring ranks starts to taper off. So @inducer's hunch was correct, but I'm very surprised how come communication nodes outgrow the total numberof nodes on a single rank. Either something in Because of these many nodes, we materialize more arrays and the generated loopy kernel has an increasing number of statements.
|
I completely agree! FWIW, I think that's also consistent with where we left off in terms of the mysteriously-large sum of flux expressions in July. I strongly suspect that something there is causing us to do way more work than necessary. Indirectly, I think this is also what caused the unreasonably-large argument counts. I tried to figure out for a while what that might be, but I ran out of time before my trip. To allow us to run at all, we decided to pursue SVM, but I think the underlying issue remains. |
@inducer, @kaushikcfd You guys are amazing. Fixing the number of rank neighbors to 2, gives a ~fixed compile time regardless of nrank, and excellent weak scaling.
Potentially we can do this with our production/prediction cases by slicing them in the flow-normal direction. |
Heh. I'm glad that works, but it's not exactly a practical constraint. :) I.e., we shouldn't have to do that. But if it serves as a stopgap for a bit, I guess I could live with that. |
Looking at this with @matthiasdiener, we were able to reproduce an allegedly pathological scaling for the DAG using the For the wave example in mirgecom:
A DAG viz can be found here, where we're tracking the ongoing effort. We plan on creating an even simpler example to get more insight about why the DAG is scaling this way. |
You're right, that's a healthy amount of growth even just for the wave operator. I find that the DAG pictures have low information density, i.e. it's hard to "see the forest", because the nodes are so big and so unevenly sized. I think we can help ourselves by making it so we can see more of the connectivity in one screenful. |
Even our super simple grad-only DG operator produced a pretty big graph with 58 nodes. So we rolled a very simple non-DG example that I think illustrates what's going on. Please have a look at the bottom of the running notes for a summary: |
To be clear, some level of duplication in the DAG is expected because we do not carry a global "registry" of constant arrays (or data wrappers in pytato-speak). So every time pytato sees a constant (like an array used for indirect indexing, or an operator matrix), it views it as new and makes a separate graph initially. As close to the first step in the transform pipeline, this redundancy is removed. To see what this will do, simply feed the DAG through |
The number of graph nodes reported, and created by the examples that reproduce and demonstrate the issue are unchanged by the I've been using this def my_ctc(prg_id, stage, prg):
from pytato.transform import deduplicate_data_wrappers
if stage == "pre_find_distributed_partition":
print(f"NUM DAG NODES BEFORE DeDupe: {pt.analysis.get_num_nodes(prg)}")
prg_dedupe = deduplicate_data_wrappers(prg)
print(f"NUM DAG NODES AFTER DeDupe: {pt.analysis.get_num_nodes(prg_dedupe)}")
pt.show_dot_graph(prg_dedupe)
1/0 I think that the example at the bottom of the notes demonstrates that the observed DAG scaling with BCs and with partition boundaries is actually expected. The "sub-dag" that represents the flux calculations is duplicated with every unique BC or partition boundary. Since communication/recv dag nodes are unique for each mpi neighbor, the flux calculation dag nodes are repeated for each mpi neighbor. We are struggling to come up with a viable plan for what to do about it. |
I agree with you; we certainly splat a copy of the facial flux into the DAG once for each parallel boundary. I think there are probably other less-than-ideal things we're doing, but that one seems sufficiently straightforward to fix. The duplication of DAG nodes can be avoided by making an array that concatenate (via I can see two possible gotchas, but maybe @kaushikcfd can help with those:
|
This is great. My attempts so far have failed.
After Friday's discussion, we came up with a short list of things to try. I had thought that projecting all of the received data to a surface that was all of the remote partition boundaries union'd together would resolve this (duplicating not the flux calc, but instead just the projection), but @majosm suggested maybe just projecting to "all_faces" would be easier, quicker to try, so I tried this but it did not seem to resolve the issue. I think lazy was too smart and saw right through this attempt to hide where the data came from. I hope this concatenation strategy works better. Based on the failed attempt to resolve by projecting to "all_faces", do you still expect concatenating the arrays will work around the issue? What about "evaluating" each received array as a part of the "communication dag node"? Would that help?
Got it. The normals I think seem straightforward, but do we explicitly handle index arrays? I assume you are talking about index arrays down in the DOFArray or other discretization data structure beyond those that would be updated automatically by |
I was going through MikeC's driver with a single One odd characteristic of the generated code is the way the communicated values are handled, it is currently handled as--
The problem is in step (2) where the flux computation for each neighboring rank is "duplicated", however, if one were to implement this by hand, one would concatenate all the flux values contributing to the interior faces and perform the numerical flux computation using this single concatenated array per field (instead for each of the The current expression structure is not only adding too many nodes in the graph but also forcing the computation of smaller-sized arrays (per boundary facial values) which is a poor workload for wide SIMD devices. I'm guessing this demands some changes in (MikeC's |
One possible caveat: concatenating all of the arrays and computing the fluxes together would (I think) remove the ability to overlap communication/computation here. Not sure if this is something distributed lazy currently does/will do? Also, if I understand correctly this is close to what @MTCam tried by projecting all of the trace pairs to Do we have a sense of what a reasonable number of nodes per flux computation should be? >5000 per trace pair (rough guess) seems like a lot? |
The restructuring I was pointing to comes after the facial values are received, so I think it shouldn't hurt it.
I think that with every new neighbor there should be <10 extra nodes in the DAG per flux component, another |
I share @majosm's concern here. The duplication of the flux computation is what allows the fluxes to be computed from some boundaries before communication has completed on all boundaries. Requiring all communication to complete before computing fluxes seems to work counter to an important advantage/feature of lazy.
I have not looked for the prediction driver yet, but for the example grad(u) only operator, the number of dag nodes for the flux computation is ~40. For a CNS fluid-only case it was ~1500. IIRC mixture pushed it up around 4000 per flux computation. |
This is a weak scaling (periodic 3D box, p=3) case to see how far we can push running real simulations on Lassen GPUs. It has 48k elements/rank and exercises most of the components used in our prediction. After running successfully for all mesh sizes (nelem, ngpus)=[(48k, 1), (96k, 2), (192k, 4), (384k, 8), (768k, 16), (1.5M, 32), (3M, 64) , (6M, 128)], it fails at (nelem, ngpus) = (12M, 256).
*Note: To reproduce, make sure to request at least 5 hours on 64 Lassen batch nodes.
Instructions for reproducing on Lassen:
Branch: mirgecom@production
Driver:
examples/combozzle-mpi.py
Setup:
setup.yaml
)The case does have an unexplained hang at array context creation time which grows with the number of ranks. For this case, it will hang for about 3 hours, and then compile for about 10 minutes before finally ending with this error:
Gist with the entire batch output file:
gist.github.com/MTCam/f48991f584755b6a8530dd9345dc2de4
The text was updated successfully, but these errors were encountered: