Skip to content

Parallelizing Bootstrapped Distributions in iPython

June 13, 2014

I tend to be highly suspicious when I perform statistical tests and they come out significant. Permutation sampling (or bootstrapping distributions to empirically calculate an estimate of the p-value) tends to allay my fears that the test I have chosen is in some way inappropriate. This page gives a great overview of using bootstrapped distributions to calculate confidence intervals around a statistic, and p-values in Python. I have used the permutation_resampling function used in the page as the basis for the foregoing.

import numpy as np
import numpy.random as npr
import pylab

def permutation_resampling(case,
                           control,
                           num_samples,
                           statistic):
    observed_diff = abs(statistic(case) -
                    statistic(control))
    num_case = len(case)
    combined = np.concatenate([case, control])
    diffs = []

    for i in range(num_samples):
        xs = npr.permutation(combined)
        diff = np.mean(xs[:num_case]) -
               np.mean(xs[num_case:])
        diffs.append(diff)

    pval = (np.sum(diffs > observed_diff) +
                  np.sum(diffs < -observed_diff)
                  )/float(num_samples)
    return pval, observed_diff, diffs

In the particular case that I was applying this, my two original tests, an independent two sample t-test, and the Mann-Whitney U both gave p-values less than 10^-7 and so I would need at least 10 000 000 samples to calculate anything as my empirical p-value. For the simple case I was working in this was not entirely intractable, but I wanted to take advantage of all the computing power that I had available to me.

Fortunately, as well as giving an interactive, graphical environment for Python, iPython notebook also makes it very easy to launch parallel computing engines. Although it seems like the for loop in the above function could be easily parallelized, the use of non-encapsulated functions and variables actually makes it difficult for iPython’s native parallelization to directly implement.

iPython uses the Python pickle library to serialize objects to be distributed to each of the engines – unfortunately, pickle does not support properly serialization of a range of objects, including functions containing closures. To get around this, a drop in replacement for pickle, dill, has been developed that greatly increases the serialization capabilities of pickle. This article gives an overview of the steps required to set up dill for use within iPython – the ones required for this are shown here:


import pickle
import dill

from types import FunctionType
from IPython.utils.pickleutil import can_map
can_map.pop(FunctionType, None)

from IPython.kernel.zmq import serialize
serialize.pickle = pickle

This code snippet has now replaced standard iPython serialization with the dill module. This allows us to change the above permutation_resampling function – now calling a nested function to create random resamples of the data, which can then be farmed out independently to all the parallel computing engines.

import pylab
from IPython.parallel import Client

def permutation_resampling(case,
                           control,
                           num_samples,
                           statistic):
    """
    Returns p-value that statistic
    for case is different
    from statistic for control.
    """

    # Set up the parallel computing client
    c = Client()
    # Create a reference to all the parallel
    # computing engines
    dview = c[:]
    # Block execution while processing
    dview.block = True
    observed_diff = abs(statistic(case) -
                    statistic(control))

    num_case = len(case)

    combined = np.concatenate([case, control])

    # Do permutation sampling as a callable function
    def sample_it(i):
        import numpy as np
        import numpy.random as npr
        xs = npr.permutation(combined)
        diff = np.mean(xs[:num_case]) -
               np.mean(xs[num_case:])
        return diff

    # Create list of differences in means by mapping
    # function - repeating the function call as many
    # times as required by the number of samples
    diffs = dview.map(sample_it,range(num_samples))
    pval = (np.sum(diffs > observed_diff) +
            np.sum(diffs < -observed_diff)
            )/float(num_samples)

    return pval, observed_diff, diffs

And there it is! Parallelized bootstrapping of distributions – with 10 000 000 samples able to be created in a short amount of time (barring overheating of the CPU).

Advertisements

From → Science

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: