Performance optimisation in pynbody

Pynbody is built on top of numpy, which means that learning how to optimize numpy array manipulations is the most important route to writing efficient code; see http://scipy-lectures.github.com/advanced/optimizing/index.html for an introduction.

However there are a few issues which are specific to pynbody. First, a large number of pynbody‘s most common operations are parallelized: make sure you have set up these routines to behave in a way that matches your needs by reading the page on Use of multiple processors by pynbody.

Other than that, there are some more subtle issues. The most important of these that we have come across is the overheads incurred by using `SubSnap`s which are explained below.

Overheads of SubSnaps

A template for performance-critical code

To cut a long story short, if your routine does a lot of array access on an object which might be a SubSnap of a certain flavour (explained further below), you will find that wrapping your code as follows speeds it up.

def my_expensive_operation(sim_or_subsim) :
    with sim_or_subsim.immediate_mode :
        mass_array = sim_or_subsim['mass']
        other_array = sim_or_subsim['other']
        # ... get other arrays required...

        #
        # perform multiple operations on arrays
        #

        # At end, copy back results if the arrays have
        # changed
        sim_or_subsim['other'] = other_array

The remainder of this document unpacks what this does and why it should be necessary.

What is a SubSnap, really?

When you construct a SubSnap, the framework records which particles of the underlying SimSnap are included and which are not. Thereafter, if you access an array from the SubSnap, it is constructed in one of two ways.

  • If the set of particles indexed by the SubSnap is expressable a slice, the arrays constructed are still numpy arrays. This will be the case if you explicitly slice the simulation (e.g. f[2:100:3]), or if you ask for a specific particle family (e.g. f.dm).
  • If the set of particles is not expressable in this way, the arrays constructed are emulating numpy arrays and this can become expensive (see below). This will be the cae if you ask for a list of particles (e.g. f[[2,10,15,22]]) use a numpy-like indexing trick (e.g. f[f['x']>10] or f[numpy.where(f['x']>10)]) or use a filter.

In the first case, the optimization is unchanged from the raw SimSnap case. In the second case, the situation is different. To understand how and why, we need to look at the difference between an indexed and a sliced numpy array.

Numpy’s behaviour

This section explains why the reason for the slightly awkward design of IndexedSubArray. The pynbody framework requires all sub-arrays to continue pointing to the original data but a simple experiment with numpy shows that it does not enable this behaviour in all cases that we want to cover.

Here’s what happens when you use a slice of an existing numpy array.

In [1]: import numpy as np

In [2]: a = np.zeros(10)

In [3]: b = a[3:5]

In [4]: b[1] = 100

In [5]: a
Out[5]: array([   0.,    0.,    0.,    0.,  100.,    0.,    0.,    0.,    0.,    0.])

The a array has been updated as required, because the b and a objects actually point back to the same part of the computer memory.

On the other hand, when you index a numpy array, the behaviour is different.

In [6]: c = a[[4,5,6]]

In [7]: c[1] = 200

In [8]: a
Out[8]: array([   0.,    0.,    0.,    0.,  100.,    0.,    0.,    0.,    0.,    0.])

Here changing c has not updated a. That’s because the construction of c actually copied the relevant data instead of just pointing back at it. This is necessitated by the underlying design of numpy arrays requiring the data to lie in a predictable pattern in the memory.

Back to pynbody

The IndexedSubArray class fixes this problem:

In [9]: import pynbody

In [10]: d = pynbody.array.IndexedSimArray(a, [4,5,6])

In [11]: d[1] = 200

In [12]: a
Out[12]: array([   0.,    0.,    0.,    0.,  100.,  200.,    0.,    0.,    0.,    0.])

Note that a has been updated correctly. This is achieved by the IndexedSimArray emulating, rather than wrapping, a numpy array; internally the syntax d[1]=200 is then translated into a[[4,5,6][1]]=200.

The cost of this is that each time you call a function that requires a numpy array as an input, the IndexedSimArray has to generate a proxy for this purpose. This can become slow. Have a look at the following timings:

In [13]: %time for i in xrange(10000) : d+=1
CPU times: user 395 ms, sys: 11.4 ms, total: 407 ms
Wall time: 419 ms

In [14]: %time for i in xrange(10000) : a+=1
CPU times: user 32.5 ms, sys: 5.74 ms, total: 38.3 ms
Wall time: 34.6 ms

Adding to the subarray is slower than adding to the entire array! This is because of the overheads of continually constructing proxy numpy arrays to pass to the __add__ method.

How to remove this bottleneck

We should emphasize that the example above is quite contrived, since it forces re-construction of the numpy proxy 10000 times. In user code, the number of numpy proxies that have to be constructed will be vastly smaller, so the fractional overheads are normally quite small.

Nonetheless, it does sometimes become a problem for performance-critical code. For that reason, it’s possible to avoid constructing IndexedSimArray`s altogether and force only `numpy arrays to be returned. This means you must take responsibility for understanding which operations copy, as opposed to referencing, data.

This is known as immediate mode and is activated using python’s with mechanism. Let’s create a test snapshot and a subview into that snapshot to try it out.

In [15]: f = pynbody.new(dm=100)

In [16]: sub_f = f[[20,21,22]]

Under normal conditions, the type of arrays returned from sub_f is IndexedSimArray. Updating one of these arrays will transparently update the main snapshot.

In [17]: sub_mass = sub_f['mass']

In [18]: type(sub_mass)
Out[18]: pynbody.array.IndexedSimArray

In [19]: sub_mass[:]=3

In [20]: f['mass'][[20,21,22]]
Out[20]: SimArray([ 3.,  3.,  3.])

Conversely, in immediate mode, the type of arrays returned from sub_f is SimArray (so just a wrapper round a real numpy array). But updating that returned numpy array has no effect on the parent snapshot.

In [21]: with f.immediate_mode :
   ....:     sub_mass = sub_f['mass']
   ....: 

In [22]: type(sub_mass)
Out[22]: pynbody.array.SimArray

In [23]: sub_mass
Out[23]: SimArray([ 3.,  3.,  3.])

In [24]: sub_mass[:]=5

In [25]: sub_mass # updated as expected
Out[25]: SimArray([ 5.,  5.,  5.])

In [26]: f['mass'][[20,21,22]] # NOT updated - should still be 3,3,3!
Out[26]: SimArray([ 3.,  3.,  3.])

So it becomes your responsibility to copy the results back in this case, if required. A template for performance critical code which might be operating on a SubSnap follows.

Note

with f_sub.immediate_mode is equivalent to with f.immediate_mode where f_sub is any SubSnap of f.

So in summary, the template code at the start of this document advocates:

  • storing a copy of the data for the subset of particles
  • working on the copy
  • (if necessary) updating the main snapshot data explicitly before returning

Overheads of raw SimArrays

Note

This information is provided for interest. We have never come across a realistic use case where the following is necessary.

In pynbody, arrays are implemented by the class SimArray. This is a wrapper around a numpy array. There is a small extra cost associated with every operation to allow units to be matched and updated. For long arrays such as those found in typical simulations, this is usually a tiny fraction of the actual computation time. We have never found it to be a problem, but if you want to disable the unit tracking you can always do so using numpy‘s view mechanism to get a raw numpy array. Suppose you have a SimSnap f; then pos = f['pos'].view(numpy.ndarray) (for example) will return the position array without any of the SimArray wrapping. The new pos variable can be manipulated without any unit handling code being called.