# Parallelizing Bootstrapped Distributions in iPython

I tend to be highly suspicious when I perform statistical tests and they come out significant. Permutation sampling (or bootstrapping distributions to empirically calculate an estimate of the p-value) tends to allay my fears that the test I have chosen is in some way inappropriate. This page gives a great overview of using bootstrapped distributions to calculate confidence intervals around a statistic, and p-values in Python. I have used the permutation_resampling function used in the page as the basis for the foregoing.

import numpy as np import numpy.random as npr import pylab def permutation_resampling(case, control, num_samples, statistic): observed_diff = abs(statistic(case) - statistic(control)) num_case = len(case) combined = np.concatenate([case, control]) diffs = [] for i in range(num_samples): xs = npr.permutation(combined) diff = np.mean(xs[:num_case]) - np.mean(xs[num_case:]) diffs.append(diff) pval = (np.sum(diffs > observed_diff) + np.sum(diffs < -observed_diff) )/float(num_samples) return pval, observed_diff, diffs

In the particular case that I was applying this, my two original tests, an independent two sample t-test, and the Mann-Whitney U both gave p-values less than 10^-7 and so I would need at least 10 000 000 samples to calculate anything as my empirical p-value. For the simple case I was working in this was not entirely intractable, but I wanted to take advantage of all the computing power that I had available to me.

Fortunately, as well as giving an interactive, graphical environment for Python, iPython notebook also makes it very easy to launch parallel computing engines. Although it seems like the for loop in the above function could be easily parallelized, the use of non-encapsulated functions and variables actually makes it difficult for iPython’s native parallelization to directly implement.

iPython uses the Python pickle library to serialize objects to be distributed to each of the engines – unfortunately, pickle does not support properly serialization of a range of objects, including functions containing closures. To get around this, a drop in replacement for pickle, dill, has been developed that greatly increases the serialization capabilities of pickle. This article gives an overview of the steps required to set up dill for use within iPython – the ones required for this are shown here:

import pickle import dill from types import FunctionType from IPython.utils.pickleutil import can_map can_map.pop(FunctionType, None) from IPython.kernel.zmq import serialize serialize.pickle = pickle

This code snippet has now replaced standard iPython serialization with the dill module. This allows us to change the above permutation_resampling function – now calling a nested function to create random resamples of the data, which can then be farmed out independently to all the parallel computing engines.

import pylab from IPython.parallel import Client def permutation_resampling(case, control, num_samples, statistic): """ Returns p-value that statistic for case is different from statistic for control. """ # Set up the parallel computing client c = Client() # Create a reference to all the parallel # computing engines dview = c[:] # Block execution while processing dview.block = True observed_diff = abs(statistic(case) - statistic(control)) num_case = len(case) combined = np.concatenate([case, control]) # Do permutation sampling as a callable function def sample_it(i): import numpy as np import numpy.random as npr xs = npr.permutation(combined) diff = np.mean(xs[:num_case]) - np.mean(xs[num_case:]) return diff # Create list of differences in means by mapping # function - repeating the function call as many # times as required by the number of samples diffs = dview.map(sample_it,range(num_samples)) pval = (np.sum(diffs > observed_diff) + np.sum(diffs < -observed_diff) )/float(num_samples) return pval, observed_diff, diffs

And there it is! Parallelized bootstrapping of distributions – with 10 000 000 samples able to be created in a short amount of time (barring overheating of the CPU).