.. performance tutorial .. _performance: Performance optimisation in pynbody =================================== `Pynbody` is built on top of `numpy`, which means that learning how to optimize `numpy` array manipulations is the most important route to writing efficient code; see http://scipy-lectures.github.com/advanced/optimizing/index.html for an introduction. However there are a few issues which are specific to `pynbody`. First, a large number of `pynbody`'s most common operations are parallelized: make sure you have set up these routines to behave in a way that matches your needs by reading the page on :ref:`threads`. Other than that, there are some more subtle issues. The most important of these that we have come across is the overheads incurred by using `SubSnap`s which are explained below. Overheads of SubSnaps ------------------------ A template for performance-critical code ******************************************** To cut a long story short, if your routine does a lot of array access on an object which might be a `SubSnap` of a certain flavour (explained further below), you will find that wrapping your code as follows speeds it up. .. code-block:: python def my_expensive_operation(sim_or_subsim) : with sim_or_subsim.immediate_mode : mass_array = sim_or_subsim['mass'] other_array = sim_or_subsim['other'] # ... get other arrays required... # # perform multiple operations on arrays # # At end, copy back results if the arrays have # changed sim_or_subsim['other'] = other_array The remainder of this document unpacks what this does and why it should be necessary. What is a SubSnap, really? **************************** When you construct a `SubSnap`, the framework records which particles of the underlying `SimSnap` are included and which are not. Thereafter, if you access an array from the `SubSnap`, it is constructed in one of two ways. - If the set of particles indexed by the `SubSnap` is expressable a slice, the arrays constructed are still `numpy` arrays. This will be the case if you explicitly slice the simulation (e.g. ``f[2:100:3]``), or if you ask for a specific particle family (e.g. ``f.dm``). - If the set of particles is not expressable in this way, the arrays constructed are emulating `numpy` arrays and this can become expensive (see below). This will be the cae if you ask for a list of particles (e.g. ``f[[2,10,15,22]]``) use a `numpy`-like indexing trick (e.g. ``f[f['x']>10]`` or ``f[numpy.where(f['x']>10)]``) or use a :py:mod:`filter `. In the first case, the optimization is unchanged from the raw `SimSnap` case. In the second case, the situation is different. To understand how and why, we need to look at the difference between an indexed and a sliced `numpy` array. Numpy's behaviour *********************** This section explains why the reason for the slightly awkward design of `IndexedSubArray`. The `pynbody` framework requires all sub-arrays to continue pointing to the original data but a simple experiment with numpy shows that it does not enable this behaviour in all cases that we want to cover. Here's what happens when you use a slice of an existing `numpy` array. .. ipython:: In [2]: import numpy as np In [3]: a = np.zeros(10) In [4]: b = a[3:5] In [5]: b[1] = 100 In [6]: a Out[6]: array([ 0., 0., 0., 0., 100., 0., 0., 0., 0., 0.]) The `a` array has been updated as required, because the `b` and `a` objects actually point back to the same part of the computer memory. On the other hand, when you *index* a `numpy` array, the behaviour is different. .. ipython:: In [7]: c = a[[4,5,6]] In [8]: c[1] = 200 In [9]: a Out[9]: array([ 0., 0., 0., 0., 100., 0., 0., 0., 0., 0.]) Here changing `c` has not updated `a`. That's because the construction of `c` actually *copied* the relevant data instead of just pointing back at it. This is necessitated by the underlying design of `numpy` arrays requiring the data to lie in a predictable pattern in the memory. Back to pynbody ****************** The `IndexedSubArray` class fixes this problem: .. ipython :: In [10]: import pynbody In [12]: d = pynbody.array.IndexedSimArray(a, [4,5,6]) In [13]: d[1] = 200 In [14]: a Out[14]: array([ 0., 0., 0., 0., 100., 200., 0., 0., 0., 0.]) Note that `a` has been updated correctly. This is achieved by the `IndexedSimArray` *emulating*, rather than *wrapping*, a `numpy` array; internally the syntax `d[1]=200` is then translated into `a[[4,5,6][1]]=200`. The cost of this is that each time you call a function that requires a `numpy` array as an input, the `IndexedSimArray` has to generate a proxy for this purpose. This can become slow. Have a look at the following timings: .. ipython :: In [22]: %time for i in range(10000) : d+=1 CPU times: user 0.25 s, sys: 0.03 s, total: 0.28 s Wall time: 0.26 s In [23]: %time for i in range(10000) : a+=1 CPU times: user 0.04 s, sys: 0.00 s, total: 0.04 s Wall time: 0.04 s Adding to the subarray is *slower* than adding to the entire array! This is because of the overheads of continually constructing proxy `numpy` arrays to pass to the ``__add__`` method. How to remove this bottleneck *************************************** We should emphasize that the example above is quite contrived, since it forces re-construction of the `numpy` proxy 10000 times. In user code, the number of `numpy` proxies that have to be constructed will be vastly smaller, so the fractional overheads are normally quite small. Nonetheless, it does sometimes become a problem for performance-critical code. For that reason, it's possible to avoid constructing `IndexedSimArray`s altogether and force only `numpy` arrays to be returned. This means you must take responsibility for understanding which operations copy, as opposed to referencing, data. This is known as `immediate mode` and is activated using python's `with` mechanism. Let's create a test snapshot and a subview into that snapshot to try it out. .. ipython :: In [24]: f = pynbody.new(dm=100) In [25]: sub_f = f[[20,21,22]] Under normal conditions, the type of arrays returned from ``sub_f`` is ``IndexedSimArray``. Updating one of these arrays will transparently update the main snapshot. .. ipython :: In [36]: sub_mass = sub_f['mass'] In [30]: type(sub_mass) In [37]: sub_mass[:]=3 In [35]: f['mass'][[20,21,22]] [ 3. 3. 3.] Conversely, in `immediate mode`, the type of arrays returned from ``sub_f`` is ``SimArray`` (so just a wrapper round a real `numpy` array). But updating that returned `numpy` array has *no effect* on the parent snapshot. .. ipython :: In [32]: with f.immediate_mode : ....: sub_mass = sub_f['mass'] ....: In [30]: type(sub_mass) In [30]: sub_mass In [64]: sub_mass[:]=5 In [30]: sub_mass # updated as expected In [39]: f['mass'][[20,21,22]] # NOT updated - should still be 3,3,3! Out[39]: SimArray([ 3., 3., 3.]) So it becomes your responsibility to copy the results back in this case, if required. A template for performance critical code which might be operating on a `SubSnap` follows. .. note:: ``with f_sub.immediate_mode`` is equivalent to ``with f.immediate_mode`` where `f_sub` is any `SubSnap` of `f`. So in summary, the template code at the start of this document advocates: - storing a *copy* of the data for the subset of particles - working on the copy - (if necessary) updating the main snapshot data explicitly before returning Overheads of raw SimArrays -------------------------- .. note:: This information is provided for interest. We have never come across a realistic use case where the following is necessary. In `pynbody`, arrays are implemented by the class :class:`~pynbody.array.SimArray`. This is a wrapper around a `numpy` array. There is a small extra cost associated with every operation to allow units to be matched and updated. For long arrays such as those found in typical simulations, this is usually a tiny fraction of the actual computation time. We have never found it to be a problem, but if you want to disable the unit tracking you can always do so using `numpy`'s view mechanism to get a raw `numpy` array. Suppose you have a `SimSnap` ``f``; then ``pos = f['pos'].view(numpy.ndarray)`` (for example) will return the position array without any of the `SimArray` wrapping. The new `pos` variable can be manipulated without any unit handling code being called.