I’m currently experiencing a significant slowdown when launching Julia with MPI on my cluster. I suspect that the issue might be due to the long HDD access times.
To mitigate this, I am considering copying the Julia program to the SSD-based scratch space shared among the nodes. The scratch space is wiped after every job, so I would like to avoid the overhead of re-precompiling packages each time. Therefore, I want to copy the precompiled environment as well.
Here are my specific questions:
Which Julia files or directories should I copy to the scratch space to ensure smooth and fast startup?
Which environment variables need to be set or adjusted when running Julia from the scratch space?
Are there any best practices or recommendations for managing precompiled packages in this context?
I would greatly appreciate any advice or insights from those who have faced similar challenges.
Go to a node with access to local scratch, e.g., a single compute node. Set the JULIA_DEPOT_PATH to a folder on the local scratch (note: if you’re using Julia v1.10, it needs to be a stable path, i.e., one that exists on all nodes). Set up your depot and precompile all packages. tar the entire depot folder and save it to regular (non-temporary) storage.
Then, when running a job, in your jobscript run a once-per-node job that unpacks the prepared depot folder into local scratch again. Make sure to keep the paths the same, otherwise you will need to precompile again (this was fixed in v1.11 or v1.12, though I do not recommend these versions for production runs for various reasons). Set JULIA_DEPOT_PATH. Execute your regular, parallel job, benefiting from local, ultra-fast package loading without precompilation.
Note: We are currently in the process of writing this up properly for publication.
I’m also just running into what I think is this issue. Without some workaround it looks like Julia is almost unusable above a few hundred MPI ranks. Very glad to hear someone is working on it already! I’d vote for creating an issue on MPI.jl to document the problem and your progress on solutions @sloede?
In that case I think you’re parallel file system is not properly set up. Even the worst case setup we’ve found so far did not see any measurable slowdowns below 2000 MPI ranks. What are you using as the parallel FS - something HPC-worthy such as Lustre or GPFS, or rather something like NFS? The latter is known to scale rather poorly if not set up properly - in that case it might make sense to move your Julia depot from your home to a scratch directory with a better I/O system.
Sorry, I didn’t check carefully enough - was trying to do strong/weak scaling scans and saw lots of failures. Looking more carefully, 2048 processes is OK, and I have problems* at 4096, which sounds like it fits with what you’re saying (the file system is Lustre).
Unfortunately I don’t see a per-node scratch on this system. I might have to contact the cluster admins.
* by ‘problems’ I mean that the job runs for 30 minutes (which was the timeout I’d set in the submission script) without completing when it should have finished in about 10. After about 10 minutes, my code does start running (I get the output from a print statement), but then a few warnings like
┌ Warning: failed to remove pidfile on close
│ path = ".../.julia/logs/manifest_usage.toml.pid"
│ removed = false
└ @ FileWatching.Pidfile .../julia-1.11.5/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:347
(where I’ve truncated the file paths for privacy), then nothing until the job times out.
@johnh Some info about the cluster I’m working on:
Processor: 2× AMD EPYC 7742, 2.25 GHz, 64-core
Cores per node: 128 (2× 64-core processors)
NUMA structure: 8 NUMA regions per node (16 cores per NUMA region)
Memory per node: 256 GiB
Interconnect: HPE Cray Slingshot, 2× 100 Gbps bi-directional per node
work File Systems: 14.5 PB HPE Cray ClusterStor
Operating system: HPE Cray Linux Environment (based on SLES 15)
Scheduler: Slurm configured to be node exclusive (smallest unit of resource is a full node)
The interesting issue I’m having trying to implement a workaround like @sloede suggested is that on this cluster there is no local storage on the compute nodes. I’m trying to use /dev/shm instead. At the moment struggling with making sure I clean up my temporary files 100% reliably, but that’s a bash/SLURM issue, not a Julia one!
Can confirm @sloede’s solution does work for me as well. The cluster I’m working on doesn’t have any node-local disk, but it does have a /tmp/ that is stored in RAM. Copying the depot into that /tmp/ fixes the startup problems. I’ve been able to run on up to 8192 cores - I didn’t try going higher because my test case is already scaling badly by that point, but this is already 4x the cores I could run on before (with some uncertainty in the exact values - I’ve only been trying 2048, 4096, 8192 cores!).