Tangos is a system for building and querying databases summarising the results of numerical galaxy simulations.
Learn more in the following categories:
If you want to speed up time-consuming tangos
operations from your command line, you can run many of them in parallel. For this you can either use python’s built-in multiprocessing
module, or MPI.
multiprocessing
is probably your best optionThe eligible tangos
commands for parallel execution are:
tangos link
tangos write
tangos crosslink
tangos add
(since 1.8.0)multiprocessing
(since 1.8.0)From your command line or job script use:
tangos [normal tangos command here] --backend multiprocessing-<N>
where <N>
should be replaced by the number of processes to
use. See notes below on how to choose the number of processes (it’s not necessarily just the number of cores available).
mpi4py
You first need to install MPI and mpi4py
on your machine. This is straight-forward
with, for example, anaconda python distributions – just type conda install mpi4py
. With regular python distributions, you need to install MPI on your machine and then pip install mpi4py
(which will compile the python bindings).
Once this has successfully installed, you can run in parallel, using the following from your command line:
mpirun -np <N> [normal tangos command here] --backend mpi4py
Here,
<N>
should be replaced by the number of processes you wish to launch; e.g. for 10 processes, start with mpirun -np 10
tangos link ...
or tangos write ...
.--backend mpi4py
crucially instructs tangos to parallelise using the mpi4py library.Alternatively you can use the pypar
library to interface with MPI.
If you specify no backend tangos will default to running in single-processor mode which means MPI will launch N processes
that are not aware of each other’s presence. This is very much not what you want.
Limitations in the MPI library mean it’s not possible for tangos to reliably auto-detect it has been MPI-launched.
For tangos write, there are multiple parallelisation modes. For best results, it’s important to understand which is appropriate for your use case – they balance file IO, memory usage and throughput differently.
The default mode parallelises at the snapshot level. Each worker loads an entire snapshot at once, then works through the halos within that snapshot. This is highly efficient in terms of communication requirements. However, for large simulations it can cause memory usage to be overly high, and it also means that shared disks can be asked for a lot of IO all at the same time.
In case either the memory or IO requirements of this approach are not feasible, tangos write
accepts a --load-mode
argument:
--load-mode=server
: This strategy is completely different. The different processes now all work on a single timestep
at a time. The single server process will load that (single)entire snapshot and pass only the bits of the data needed along to all other ranks. This vastly lowers memory requirements (only one full simulation is in memory at
a given time). However it increases communication needs between processes, which can cause significant performance degredation.
--load-mode=server-shared-mem
: available from version 1.9.0 onwards, this is the most powerful option, but it only works if all your processes are on the same physical machine. A server process handles loading data as above, making the memory and IO requirements the same as --load-mode=server
. But then, in server-shared-mem
mode, the server makes the data available to all other processes through shared memory, which is extremely efficient.
The below are still available, but are less flexible and are rarely likely to be the best option.
--load-mode=partial
: This strategy is similar to the default described above. However, only the data for a single
halo at a time is loaded. Each time the writer moves onto the next halo, the corresponding particles are loaded.
Partial loading is pretty efficient but be aware that calculations that need particle data outside the halo
(for example see the virial radius example in the custom properties tutorial)
will fail.--load-mode=server-partial
: a hybrid approach where rank 0 loads only what is required to help the other ranks
figure out what they need to load — for example, if a property requests a sphere surrounding the halo,
the entire snapshot’s position arrays will be loaded on rank 0, but no other data.
The data on the individual ranks is loaded via partial loading (see --load-mode=partial
above).Let’s consider the longest process in the tutorials which involves writing images and more to the changa+AHF tutorial simulation.
Some of the underlying pynbody manipulations are already parallelised. One can therefore experiment with different configurations but experience suggests the best option is to switch off all pynbody parallelisation (i.e. set the number of threads to 1) and allow tangos complete control. This is because only some pynbody routines are parallelised whereas tangos is close to embarassingly parallel. Once pynbody threading is disabled, the version of the above command that is most efficient is:
mpirun -np 5 tangos write dm_density_profile gas_density_profile uvi_image --with-prerequisites --include-only="NDM()>5000" --include-only="contamination_fraction<0.01" --for tutorial_changa --backend mpi4py --load-mode server-shared-mem
for a machine with 4 processors.
How many processes? Note that the number of processes was actually one more than the number of processors. This is because the server process generally only uses the CPU while other processes are blocked. So, using only 4 processes would leave one processor idle most of the time.
What load mode? Here we chose --load-mode=server-shared-mem
by considering the possibilities:
--load-mode=partial
the calculation will fail because the density profiles ask for particles that may be
outside the region identified by the halo finder--load-mode=server-partial
, everything will be fine and relatively efficient
but we will keep reading data off the disk rather than keeping it in RAM. This might be useful for large simulations,
but isn’t needed here.--load-mode=server
(for multi-node calculations) or --load-mode=server-shared-mem
(for single-node calculations). Either is efficient in terms of computation and memory usage, but the latter is also efficient
in terms of having minimal communication overheads between processes.For tangos link
and tangos crosslink
, parallelisation is currently implemented only at the snapshot level. Suppose you have a simulation
with M particles. Each worker loads needs to store at least 2M integers at any one time (possibly more depending on the
underlying formats) in order to work out the merger tree.
Consequently for large simulations, you may need to use a machine with lots of memory and/or use fewer processes than you have cores available.
The tangos add
tool is also only parallelised at the snapshot
level, but it does not need to load much data per snapshot and so
this does not normally pose a problem.