Notes on computations distribution
Contents
Notes on computations distribution#
The computations can be distributed in several ways.
This section gives a more in-depth understanding of the integrate-slurm
and integrate-mp
commands.
1. How integrate-slurm
works#
From the configuration file (section [dataset]
), the user provides a list of datasets.
Each dataset consists in a collection of frames (hundreds to tens of thousands).
The command integrate-slurm
will distribute the datasets integration over n_workers
, each using cores_per_worker
threads
(these parameters are tuned in the [computations distribution]
of the configuration file).
The datasets are integrated one after the other. Each dataset is split among the worker: each worker will process a part of the current dataset.
Advantages
Each dataset can be integrated fast. This means that you can have a feed-back on a given dataset quickly.
Scales well with the number of workers.
Drawbacks
Need resources, often occupied
Brittle when using many workers (although it works fine with 40 workers)
Only runs on SLURM (not on the local machine), needs to connect to a SLURM front-end
2. How integrate-mp
works#
The integrate-mp
program was created to address the shortcomings of integrate-slurm
:
Possibility to run on the local machine
Fully utilize the GPU processing power by assigning several workers to each GPU (not possible natively on SLURM)
Use SLURM partitions that have more availability
The drawback is that at a given time, each dataset is processed by exactly one worker. This means that integrating one dataset is not extremely fast (hence not suited for near-real-time processing). This program is rather intended for bulk processing of hundreds of datasets.
2.1 - Using integrate-mp
on the local machine (partition = local
)#
To distribute the azimuthal integration on the local machine, choose partition = local
in [computations distribution]
.
A total of n_workers
will be launched, each having cores_per_worker
CPUs.
The GPUs are used if possible. If a machine has multiple gpus, they will be shared evenly among the workers.
There can (and should) be multiple workers per GPU. The rationale is, one azimithal integrator is not enough to exploit the full GPU computing power.
How to tune the number of workers/cores for maximum performance ?
As a rule of thumb:
4 workers per GPU (8 for powerful GPUs like V100, A40, A100)
4 threads per worker for fast LZ4 decompression (2 are enough for fast-paced processor without hyperthreading, eg. AMD EPYC 7543)
workers * threads
should not exceed the machine’s total (virtual) CPUs, ideally half of them
2.2 - Using integrate-mp
on multiple machines#
The integrate-mp
command can also be used to distribute computations on several machines, using SLURM.
The partition = gpu
or partition = p9gpu
should be used in [computations distribution]
(not local
).
When using this mode, n_workers
has a different meaning.
Here, n_workers
refers to the number of SLURM jobs.
It roughly means how many nodes (or half-node) we want to distribute computations on.
You can think of it as “big workers”: each n_workers
spanws multiple workers. By default:
8 workers by
n_workers
4 cores/worker
Each “big worker” (
n_workers
) uses one GPU
For example, using partition = p9gpu
and n_workers = 4
means that 4 SLURM jobs are submitted (asking for one GPU each), each spawning 8 workers, so there will be 32 workers in total.
Warning
Remember: when using this mode, n_workers
is the number of SLURM jobs, i.e roughly the number of half-nodes requested.
n_workers = 4
means 32 workers in total.