Skip to content

I tend to be highly suspicious when I perform statistical tests and they come out significant. Permutation sampling (or bootstrapping distributions to empirically calculate an estimate of the p-value) tends to allay my fears that the test I have chosen is in some way inappropriate. This page gives a great overview of using bootstrapped distributions to calculate confidence intervals around a statistic, and p-values in Python. I have used the permutation_resampling function used in the page as the basis for the foregoing.

```import numpy as np
import numpy.random as npr
import pylab

def permutation_resampling(case,
control,
num_samples,
statistic):
observed_diff = abs(statistic(case) -
statistic(control))
num_case = len(case)
combined = np.concatenate([case, control])
diffs = []

for i in range(num_samples):
xs = npr.permutation(combined)
diff = np.mean(xs[:num_case]) -
np.mean(xs[num_case:])
diffs.append(diff)

pval = (np.sum(diffs > observed_diff) +
np.sum(diffs < -observed_diff)
)/float(num_samples)
return pval, observed_diff, diffs
```

In the particular case that I was applying this, my two original tests, an independent two sample t-test, and the Mann-Whitney U both gave p-values less than 10^-7 and so I would need at least 10 000 000 samples to calculate anything as my empirical p-value. For the simple case I was working in this was not entirely intractable, but I wanted to take advantage of all the computing power that I had available to me.

Fortunately, as well as giving an interactive, graphical environment for Python, iPython notebook also makes it very easy to launch parallel computing engines. Although it seems like the for loop in the above function could be easily parallelized, the use of non-encapsulated functions and variables actually makes it difficult for iPython’s native parallelization to directly implement.

iPython uses the Python pickle library to serialize objects to be distributed to each of the engines – unfortunately, pickle does not support properly serialization of a range of objects, including functions containing closures. To get around this, a drop in replacement for pickle, dill, has been developed that greatly increases the serialization capabilities of pickle. This article gives an overview of the steps required to set up dill for use within iPython – the ones required for this are shown here:

```
import pickle
import dill

from types import FunctionType
from IPython.utils.pickleutil import can_map
can_map.pop(FunctionType, None)

from IPython.kernel.zmq import serialize
serialize.pickle = pickle
```

This code snippet has now replaced standard iPython serialization with the dill module. This allows us to change the above permutation_resampling function – now calling a nested function to create random resamples of the data, which can then be farmed out independently to all the parallel computing engines.

```import pylab
from IPython.parallel import Client

def permutation_resampling(case,
control,
num_samples,
statistic):
"""
Returns p-value that statistic
for case is different
from statistic for control.
"""

# Set up the parallel computing client
c = Client()
# Create a reference to all the parallel
# computing engines
dview = c[:]
# Block execution while processing
dview.block = True
observed_diff = abs(statistic(case) -
statistic(control))

num_case = len(case)

combined = np.concatenate([case, control])

# Do permutation sampling as a callable function
def sample_it(i):
import numpy as np
import numpy.random as npr
xs = npr.permutation(combined)
diff = np.mean(xs[:num_case]) -
np.mean(xs[num_case:])
return diff

# Create list of differences in means by mapping
# function - repeating the function call as many
# times as required by the number of samples
diffs = dview.map(sample_it,range(num_samples))
pval = (np.sum(diffs > observed_diff) +
np.sum(diffs < -observed_diff)
)/float(num_samples)

return pval, observed_diff, diffs

```

And there it is! Parallelized bootstrapping of distributions – with 10 000 000 samples able to be created in a short amount of time (barring overheating of the CPU).

Advertisements

From → Science

Leave a Comment