A feedback of distributed azimuthal integration and the use of dask.distributed
distributed
distributed
Context
Example
ACL9011001d/ACL9011001d_0195/scan0001/<files>
sample dataset scan
1-10 10-200 (200) 1-20 (1) (125 files - 25 k frames)
In this presentation: AI = Azimuthal Integration
(and not Artificial Intelligence for once!)
Overview of the involved steps:
How to accelerate this ?
pyFAI is already optimized (multi-years effort)
Some figures with the ID15A parameters
ID15A parameters:
GPU:
CPU
Intel Xeon Gold 6248 (hpc5):
FPS/threads
is at 10 threads (one socket)AMD EPYC 7543: 55 FPS (optimum at 4 threads only!)
Loading 200 frames (one file) typically takes 0.5-2 secs (0.99 - 4 GB/S i.e 100-400 FPS).
hdf5plugin
under the hoodhdf5plugin
benefits from multithreads up to a limit (detailed figures next slide)Loading vs processing: where is the bottleneck?
Timing to read 200 frames.
Power9
EPYC 7543
Xeon Gold 6248
distributed
Importantly, spawning an azimuthal integrator takes several seconds (to tens of seconds)
n_workers
integrators and feed them with stacks of images/datasets
Python native multiprocessing
multiprocessing.Queue
)MPI
Distributed
OARCLuster
, SLURMCluster
, ...)distributed
dask
task graphfrom distributed import Client, LocalCluster, wait
from dask_jobqueue import SLURMCluster
def say_hello(x):
print("Hello, I'm", os.getpid(), "from", socket.gethostname())
cluster = SLURMCLuster(cluster_specs) # or LocalCluster()
cluster.scale(n_workers)
client = Client(worker)
future = client.submit(say_hello) # concurrent.futures object
print(future)
wait([future], timeout=timeout) # blocks until task is completed/failed
distributed
process dataset with this DataUrl
")distributed
Multi-nodes, CPU-only (nice
partition, 10 threads/worker)
Local machine: p9-04
(power9 + 2 V100)
Local machine: gpid16axni
(AMD EPYC 7543 + 2 A100)
multiprocessing
1 dataset per "worker", so granularity = scan
!
p9-04
(power9 + 2 V100)
gpid16axni
(AMD EPYC 7543 + 2 A100)
Notes:
dask.distributed
has a surprisingly low overheaddistributed
distributed
The good
SLURMCluster
, LocalCluster
abstractionmultiprocessing
distributed
The bad
The ugly
Thread
(will fight for semaphore - same event loop?)When distributing computations on the cluster, many things can (and will) go wrong.
nice
partition often crowded with hundreds of jobsUsing the RPC approach with remote classes (distributed
"actors"):
Advantages
Drawbacks
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |