Chapter 2


This chapter introduces more PyMC syntax and design patterns, and ways to think about how to model a system from a Bayesian perspective. It also contains tips and data visualization techniques for assessing goodness-of-fit for your Bayesian model.

A little more on PyMC

Parent and Child relationships

To assist with describing Bayesian relationships, and to be consistent with PyMC's documentation, we introduce parent and child variables.

  • parent variables are variables that influence another variable.

  • child variable are variables that are affected by other variables, i.e. are the subject of parent variables.

A variable can be both a parent and child. For example, consider the PyMC code below.

In [1]:
import pymc as pm


parameter = pm.Exponential("poisson_param", 1)
data_generator = pm.Poisson("data_generator", parameter)
data_plus_one = data_generator + 1

parameter controls the parameter of data_generator, hence influences its values. The former is a parent of the latter. By symmetry, data_generator is a child of parameter.

Likewise, data_generator is a parent to the variable data_plus_one (hence making data_generator both a parent and child variable). Although it does not look like one, data_plus_one should be treated as a PyMC variable as it is a function of another PyMC variable, hence is a child variable to data_generator.

This nomenclature is introduced to help us describe relationships in PyMC modeling. You can access a variables children and parent variables using the children and parents attributes attached to variables.

In [2]:
print "Children of `parameter`: "
print parameter.children
print "\nParents of `data_generator`: "
print data_generator.parents
print "\nChildren of `data_generator`: "
print data_generator.children
Children of `parameter`: 
set([<pymc.distributions.Poisson 'data_generator' at 0x0000000018A6D860>])

Parents of `data_generator`: 
{'mu': <pymc.distributions.Exponential 'poisson_param' at 0x000000000A0DA128>}

Children of `data_generator`: 
set([<pymc.PyMCObjects.Deterministic '(data_generator_add_1)' at 0x000000000A0DA0B8>])

Of course a child can have more than one parent, and a parent can have many children.

PyMC Variables

All PyMC variables also expose a value attribute. This method produces the current (possibly random) internal value of the variable. If the variable is a child variable, its value changes given the variable's parents' values. Using the same variables from before:

In [3]:
print "parameter.value =", parameter.value
print "data_generator.value =", data_generator.value
print "data_plus_one.value =", data_plus_one.value
parameter.value = 0.349596095567
data_generator.value = 0
data_plus_one.value = 1

PyMC is concerned with two types of programming variables: stochastic and deterministic.

  • stochastic variables are variables that are not deterministic, i.e., even if you knew all the values of the variables' parents (if it even has any parents), it would still be random. Included in this category are instances of classes Poisson, DiscreteUniform, and Exponential.

  • deterministic variables are variables that are not random if the variables' parents were known. This might be confusing at first: a quick mental check is if I knew all of variable foo's parent variables, I could determine what foo's value is.

We will detail each below.

Initializing Stochastic variables

Initializing a stochastic variable requires a name argument, plus additional parameters that are class specific. For example:

some_variable = pm.DiscreteUniform("discrete_uni_var", 0, 4)

where 0, 4 are the DiscreteUniform-specific lower and upper bound on the random variable. The PyMC docs contain the specific parameters for stochastic variables. (Or use ?? if you are using IPython!)

The name attribute is used to retrieve the posterior distribution later in the analysis, so it is best to use a descriptive name. Typically, I use the Python variable's name as the name.

For multivariable problems, rather than creating a Python array of stochastic variables, addressing the size keyword in the call to a Stochastic variable creates multivariate array of (independent) stochastic variables. The array behaves like a Numpy array when used like one, and references to its value attribute return Numpy arrays.

The size argument also solves the annoying case where you may have many variables $\beta_i, \; i = 1,...,N$ you wish to model. Instead of creating arbitrary names and variables for each one, like:

beta_1 = pm.Uniform("beta_1", 0, 1)
beta_2 = pm.Uniform("beta_2", 0, 1)
...

we can instead wrap them into a single variable:

betas = pm.Uniform("betas", 0, 1, size=N)

Calling random()

We can also call on a stochastic variable's random() method, which (given the parent values) will generate a new, random value. Below we demonstrate this using the texting example from the previous chapter.

In [4]:
lambda_1 = pm.Exponential("lambda_1", 1)  # prior on first behaviour
lambda_2 = pm.Exponential("lambda_2", 1)  # prior on second behaviour
tau = pm.DiscreteUniform("tau", lower=0, upper=10)  # prior on behaviour change

print "lambda_1.value = %.3f" % lambda_1.value
print "lambda_2.value = %.3f" % lambda_2.value
print "tau.value = %.3f" % tau.value
print

lambda_1.random(), lambda_2.random(), tau.random()

print "After calling random() on the variables..."
print "lambda_1.value = %.3f" % lambda_1.value
print "lambda_2.value = %.3f" % lambda_2.value
print "tau.value = %.3f" % tau.value
lambda_1.value = 0.261
lambda_2.value = 1.093
tau.value = 6.000

After calling random() on the variables...
lambda_1.value = 1.607
lambda_2.value = 0.376
tau.value = 4.000

The call to random stores a new value into the variable's value attribute. In fact, this new value is stored in the computer's cache for faster recall and efficiency.

Warning: Don't update stochastic variables' values in-place.

Straight from the PyMC docs, we quote [4]:

Stochastic objects' values should not be updated in-place. This confuses PyMC's caching scheme... The only way a stochastic variable's value should be updated is using statements of the following form:

    A.value = new_value

The following are in-place updates and should never be used:

    A.value += 3
    A.value[2,1] = 5
    A.value.attribute = new_attribute_value

Deterministic variables

Since most variables you will be modeling are stochastic, we distinguish deterministic variables with a pymc.deterministic wrapper. (If you are unfamiliar with Python wrappers (also called decorators), that's no problem. Just prepend the pymc.deterministic decorator before the variable declaration and you're good to go. No need to know more. ) The declaration of a deterministic variable uses a Python function:

@pm.deterministic
def some_deterministic_var(v1=v1,):
     #jelly goes here.

For all purposes, we can treat the object some_deterministic_var as a variable and not a Python function.

Prepending with the wrapper is the easiest way, but not the only way, to create deterministic variables. This is not completely true: elementary operations, like addition, exponentials etc. implicitly create deterministic variables. For example, the following returns a deterministic variable:

In [5]:
type(lambda_1 + lambda_2)
Out[5]:
pymc.PyMCObjects.Deterministic

The use of the deterministic wrapper was seen in the previous chapter's text-message example. Recall the model for $\lambda$ looked like:

$$ \lambda = \cases{ \lambda_1 & \text{if } t \lt \tau \cr \lambda_2 & \text{if } t \ge \tau } $$

And in PyMC code:

In [6]:
import numpy as np
n_data_points = 5  # in CH1 we had ~70 data points


@pm.deterministic
def lambda_(tau=tau, lambda_1=lambda_1, lambda_2=lambda_2):
    out = np.zeros(n_data_points)
    out[:tau] = lambda_1  # lambda before tau is lambda1
    out[tau:] = lambda_2  # lambda after tau is lambda1
    return out

Clearly, if $\tau, \lambda_1$ and $\lambda_2$ are known, then $\lambda$ is known completely, hence it is a deterministic variable.

Inside the deterministic decorator, the Stochastic variables passed in behave like scalars or Numpy arrays (if multivariable), and not like Stochastic variables. For example, running the following:

@pm.deterministic
def some_deterministic(stoch=some_stochastic_var):
    return stoch.value**2

will return an AttributeError detailing that stoch does not have a value attribute. It simply needs to be stoch**2. During the learning phase, it's the variable's value that is repeatedly passed in, not the actual variable.

Notice in the creation of the deterministic function we added defaults to each variable used in the function. This is a necessary step, and all variables must have default values.

Including observations in the Model

At this point, it may not look like it, but we have fully specified our priors. For example, we can ask and answer questions like "What does my prior distribution of $\lambda_1$ look like?"

In [7]:
%matplotlib inline
from IPython.core.pylabtools import figsize
from matplotlib import pyplot as plt
figsize(12.5, 4)


samples = [lambda_1.random() for i in range(20000)]
plt.hist(samples, bins=70, normed=True, histtype="stepfilled")
plt.title("Prior distribution for $\lambda_1$")
plt.xlim(0, 8);

To frame this in the notation of the first chapter, though this is a slight abuse of notation, we have specified $P(A)$. Our next goal is to include data/evidence/observations $X$ into our model.

PyMC stochastic variables have a keyword argument observed which accepts a boolean (False by default). The keyword observed has a very simple role: fix the variable's current value, i.e. make value immutable. We have to specify an initial value in the variable's creation, equal to the observations we wish to include, typically an array (and it should be an Numpy array for speed). For example:

In [8]:
data = np.array([10, 5])
fixed_variable = pm.Poisson("fxd", 1, value=data, observed=True)
print "value: ", fixed_variable.value
print "calling .random()"
fixed_variable.random()
print "value: ", fixed_variable.value
value:  [10  5]
calling .random()
value:  [10  5]

This is how we include data into our models: initializing a stochastic variable to have a fixed value.

To complete our text message example, we fix the PyMC variable observations to the observed dataset.

In [9]:
# We're using some fake data here
data = np.array([10, 25, 15, 20, 35])
obs = pm.Poisson("obs", lambda_, value=data, observed=True)
print obs.value
[10 25 15 20 35]

Finally...

We wrap all the created variables into a pm.Model class. With this Model class, we can analyze the variables as a single unit. This is an optional step, as the fitting algorithms can be sent an array of the variables rather than a Model class. I may or may not use this class in future examples ;)

In [10]:
model = pm.Model([obs, lambda_, lambda_1, lambda_2, tau])

Modeling approaches

A good starting thought to Bayesian modeling is to think about how your data might have been generated. Position yourself in an omniscient position, and try to imagine how you would recreate the dataset.

In the last chapter we investigated text message data. We begin by asking how our observations may have been generated:

  1. We started by thinking "what is the best random variable to describe this count data?" A Poisson random variable is a good candidate because it can represent count data. So we model the number of sms's received as sampled from a Poisson distribution.

  2. Next, we think, "Ok, assuming sms's are Poisson-distributed, what do I need for the Poisson distribution?" Well, the Poisson distribution has a parameter $\lambda$.

  3. Do we know $\lambda$? No. In fact, we have a suspicion that there are two $\lambda$ values, one for the earlier behaviour and one for the latter behaviour. We don't know when the behaviour switches though, but call the switchpoint $\tau$.

  4. What is a good distribution for the two $\lambda$s? The exponential is good, as it assigns probabilities to positive real numbers. Well the exponential distribution has a parameter too, call it $\alpha$.

  5. Do we know what the parameter $\alpha$ might be? No. At this point, we could continue and assign a distribution to $\alpha$, but it's better to stop once we reach a set level of ignorance: whereas we have a prior belief about $\lambda$, ("it probably changes over time", "it's likely between 10 and 30", etc.), we don't really have any strong beliefs about $\alpha$. So it's best to stop here.

    What is a good value for $\alpha$ then? We think that the $\lambda$s are between 10-30, so if we set $\alpha$ really low (which corresponds to larger probability on high values) we are not reflecting our prior well. Similar, a too-high alpha misses our prior belief as well. A good idea for $\alpha$ as to reflect our belief is to set the value so that the mean of $\lambda$, given $\alpha$, is equal to our observed mean. This was shown in the last chapter.

  6. We have no expert opinion of when $\tau$ might have occurred. So we will suppose $\tau$ is from a discrete uniform distribution over the entire timespan.

Below we give a graphical visualization of this, where arrows denote parent-child relationships. (provided by the Daft Python library )

PyMC, and other probabilistic programming languages, have been designed to tell these data-generation stories. More generally, B. Cronin writes [5]:

Probabilistic programming will unlock narrative explanations of data, one of the holy grails of business analytics and the unsung hero of scientific persuasion. People think in terms of stories - thus the unreasonable power of the anecdote to drive decision-making, well-founded or not. But existing analytics largely fails to provide this kind of story; instead, numbers seemingly appear out of thin air, with little of the causal context that humans prefer when weighing their options.

Same story; different ending.

Interestingly, we can create new datasets by retelling the story. For example, if we reverse the above steps, we can simulate a possible realization of the dataset.

1. Specify when the user's behaviour switches by sampling from $\text{DiscreteUniform}(0, 80)$:

In [11]:
tau = pm.rdiscrete_uniform(0, 80)
print tau
14

2. Draw $\lambda_1$ and $\lambda_2$ from an $\text{Exp}(\alpha)$ distribution:

In [12]:
alpha = 1. / 20.
lambda_1, lambda_2 = pm.rexponential(alpha, 2)
print lambda_1, lambda_2
22.6794523325 37.138092023

3. For days before $\tau$, represent the user's received SMS count by sampling from $\text{Poi}(\lambda_1)$, and sample from $\text{Poi}(\lambda_2)$ for days after $\tau$. For example:

In [13]:
data = np.r_[pm.rpoisson(lambda_1, tau), pm.rpoisson(lambda_2, 80 - tau)]

4. Plot the artificial dataset:

In [14]:
plt.bar(np.arange(80), data, color="#348ABD")
plt.bar(tau - 1, data[tau - 1], color="r", label="user behaviour changed")
plt.xlabel("Time (days)")
plt.ylabel("count of text-msgs received")
plt.title("Artificial dataset")
plt.xlim(0, 80)
plt.legend();

It is okay that our fictional dataset does not look like our observed dataset: the probability is incredibly small it indeed would. PyMC's engine is designed to find good parameters, $\lambda_i, \tau$, that maximize this probability.

The ability to generate artificial dataset is an interesting side effect of our modeling, and we will see that this ability is a very important method of Bayesian inference. We produce a few more datasets below:

In [15]:
def plot_artificial_sms_dataset():
    tau = pm.rdiscrete_uniform(0, 80)
    alpha = 1. / 20.
    lambda_1, lambda_2 = pm.rexponential(alpha, 2)
    data = np.r_[pm.rpoisson(lambda_1, tau), pm.rpoisson(lambda_2, 80 - tau)]
    plt.bar(np.arange(80), data, color="#348ABD")
    plt.bar(tau - 1, data[tau - 1], color="r", label="user behaviour changed")
    plt.xlim(0, 80)

figsize(12.5, 5)
plt.title("More example of artificial datasets")
for i in range(4):
    plt.subplot(4, 1, i)
    plot_artificial_sms_dataset()

Later we will see how we use this to make predictions and test the appropriateness of our models.

Example: Bayesian A/B testing

A/B testing is a statistical design pattern for determining the difference of effectiveness between two different treatments. For example, a pharmaceutical company is interested in the effectiveness of drug A vs drug B. The company will test drug A on some fraction of their trials, and drug B on the other fraction (this fraction is often 1/2, but we will relax this assumption). After performing enough trials, the in-house statisticians sift through the data to determine which drug yielded better results.

Similarly, front-end web developers are interested in which design of their website yields more sales or some other metric of interest. They will route some fraction of visitors to site A, and the other fraction to site B, and record if the visit yielded a sale or not. The data is recorded (in real-time), and analyzed afterwards.

Often, the post-experiment analysis is done using something called a hypothesis test like difference of means test or difference of proportions test. This involves often misunderstood quantities like a "Z-score" and even more confusing "p-values" (please don't ask). If you have taken a statistics course, you have probably been taught this technique (though not necessarily learned this technique). And if you were like me, you may have felt uncomfortable with their derivation -- good: the Bayesian approach to this problem is much more natural.

A Simple Case

As this is a hacker book, we'll continue with the web-dev example. For the moment, we will focus on the analysis of site A only. Assume that there is some true $0 \lt p_A \lt 1$ probability that users who, upon shown site A, eventually purchase from the site. This is the true effectiveness of site A. Currently, this quantity is unknown to us.

Suppose site A was shown to $N$ people, and $n$ people purchased from the site. One might conclude hastily that $p_A = \frac{n}{N}$. Unfortunately, the observed frequency $\frac{n}{N}$ does not necessarily equal $p_A$ -- there is a difference between the observed frequency and the true frequency of an event. The true frequency can be interpreted as the probability of an event occurring. For example, the true frequency of rolling a 1 on a 6-sided die is $\frac{1}{6}$. Knowing the true frequency of events like:

  • fraction of users who make purchases,
  • frequency of social attributes,
  • percent of internet users with cats etc.

are common requests we ask of Nature. Unfortunately, often Nature hides the true frequency from us and we must infer it from observed data.

The observed frequency is then the frequency we observe: say rolling the die 100 times you may observe 20 rolls of 1. The observed frequency, 0.2, differs from the true frequency, $\frac{1}{6}$. We can use Bayesian statistics to infer probable values of the true frequency using an appropriate prior and observed data.

With respect to our A/B example, we are interested in using what we know, $N$ (the total trials administered) and $n$ (the number of conversions), to estimate what $p_A$, the true frequency of buyers, might be.

To setup a Bayesian model, we need to assign prior distrbutions to our unknown quantities. A priori, what do we think $p_A$ might be? For this example, we have no strong conviction about $p_A$, so for now, let's assume $p_A$ is uniform over [0,1]:

In [16]:
import pymc as pm

# The parameters are the bounds of the Uniform.
p = pm.Uniform('p', lower=0, upper=1)

Had we had stronger beliefs, we could have expressed them in the prior above.

For this example, consider $p_A = 0.05$, and $N = 1500$ users shown site A, and we will simulate whether the user made a purchase or not. To simulate this from $N$ trials, we will use a Bernoulli distribution: if $ X\ \sim \text{Ber}(p)$, then $X$ is 1 with probability $p$ and 0 with probability $1 - p$. Of course, in practice we do not know $p_A$, but we will use it here to simulate the data.

In [17]:
# set constants
p_true = 0.05  # remember, this is unknown.
N = 1500

# sample N Bernoulli random variables from Ber(0.05).
# each random variable has a 0.05 chance of being a 1.
# this is the data-generation step
occurrences = pm.rbernoulli(p_true, N)

print occurrences  # Remember: Python treats True == 1, and False == 0
print occurrences.sum()
[False False False ..., False False False]
87

The observed frequency is:

In [18]:
# Occurrences.mean is equal to n/N.
print "What is the observed frequency in Group A? %.4f" % occurrences.mean()
print "Does this equal the true frequency? %s" % (occurrences.mean() == p_true)
What is the observed frequency in Group A? 0.0580
Does this equal the true frequency? False

We combine the observations into the PyMC observed variable, and run our inference algorithm:

In [19]:
# include the observations, which are Bernoulli
obs = pm.Bernoulli("obs", p, value=occurrences, observed=True)

# To be explained in chapter 3
mcmc = pm.MCMC([p, obs])
mcmc.sample(18000, 1000)
 [-----------------100%-----------------] 18000 of 18000 complete in 1.0 sec

We plot the posterior distribution of the unknown $p_A$ below:

In [20]:
figsize(12.5, 4)
plt.title("Posterior distribution of $p_A$, the true effectiveness of site A")
plt.vlines(p_true, 0, 90, linestyle="--", label="true $p_A$ (unknown)")
plt.hist(mcmc.trace("p")[:], bins=25, histtype="stepfilled", normed=True)
plt.legend()
Out[20]:
<matplotlib.legend.Legend at 0x1927ad30>

Our posterior distribution puts most weight near the true value of $p_A$, but also some weights in the tails. This is a measure of how uncertain we should be, given our observations. Try changing the number of observations, N, and observe how the posterior distribution changes.

A and B Together

A similar anaylsis can be done for site B's response data to determine the analogous $p_B$. But what we are really interested in is the difference between $p_A$ and $p_B$. Let's infer $p_A$, $p_B$, and $\text{delta} = p_A - p_B$, all at once. We can do this using PyMC's deterministic variables. (We'll assume for this exercise that $p_B = 0.04$, so $\text{delta} = 0.01$, $N_B = 750$ (signifcantly less than $N_A$) and we will simulate site B's data like we did for site A's data )

In [21]:
import pymc as pm
figsize(12, 4)

# these two quantities are unknown to us.
true_p_A = 0.05
true_p_B = 0.04

# notice the unequal sample sizes -- no problem in Bayesian analysis.
N_A = 1500
N_B = 750

# generate some observations
observations_A = pm.rbernoulli(true_p_A, N_A)
observations_B = pm.rbernoulli(true_p_B, N_B)
print "Obs from Site A: ", observations_A[:30].astype(int), "..."
print "Obs from Site B: ", observations_B[:30].astype(int), "..."
Obs from Site A:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0] ...
Obs from Site B:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] ...

In [22]:
print observations_A.mean()
print observations_B.mean()
0.0486666666667
0.0453333333333

In [23]:
# Set up the pymc model. Again assume Uniform priors for p_A and p_B.
p_A = pm.Uniform("p_A", 0, 1)
p_B = pm.Uniform("p_B", 0, 1)


# Define the deterministic delta function. This is our unknown of interest.
@pm.deterministic
def delta(p_A=p_A, p_B=p_B):
    return p_A - p_B

# Set of observations, in this case we have two observation datasets.
obs_A = pm.Bernoulli("obs_A", p_A, value=observations_A, observed=True)
obs_B = pm.Bernoulli("obs_B", p_B, value=observations_B, observed=True)

# To be explained in chapter 3.
mcmc = pm.MCMC([p_A, p_B, delta, obs_A, obs_B])
mcmc.sample(20000, 1000)
 [-----------------100%-----------------] 20000 of 20000 complete in 2.1 sec

Below we plot the posterior distributions for the three unknowns:

In [24]:
p_A_samples = mcmc.trace("p_A")[:]
p_B_samples = mcmc.trace("p_B")[:]
delta_samples = mcmc.trace("delta")[:]
In [25]:
figsize(12.5, 10)

# histogram of posteriors

ax = plt.subplot(311)

plt.xlim(0, .1)
plt.hist(p_A_samples, histtype='stepfilled', bins=25, alpha=0.85,
         label="posterior of $p_A$", color="#A60628", normed=True)
plt.vlines(true_p_A, 0, 80, linestyle="--", label="true $p_A$ (unknown)")
plt.legend(loc="upper right")
plt.title("Posterior distributions of $p_A$, $p_B$, and delta unknowns")

ax = plt.subplot(312)

plt.xlim(0, .1)
plt.hist(p_B_samples, histtype='stepfilled', bins=25, alpha=0.85,
         label="posterior of $p_B$", color="#467821", normed=True)
plt.vlines(true_p_B, 0, 80, linestyle="--", label="true $p_B$ (unknown)")
plt.legend(loc="upper right")

ax = plt.subplot(313)
plt.hist(delta_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of delta", color="#7A68A6", normed=True)
plt.vlines(true_p_A - true_p_B, 0, 60, linestyle="--",
           label="true delta (unknown)")
plt.vlines(0, 0, 60, color="black", alpha=0.2)
plt.legend(loc="upper right");

Notice that as a result of N_B < N_A, i.e. we have less data from site B, our posterior distribution of $p_B$ is fatter, implying we are less certain about the true value of $p_B$ than we are of $p_A$.

With respect to the posterior distribution of $\text{delta}$, we can see that the majority of the distribution is above $\text{delta}=0$, implying there site A's response is likely better than site B's response. The probability this inference is incorrect is easily computable:

In [26]:
# Count the number of samples less than 0, i.e. the area under the curve
# before 0, represent the probability that site A is worse than site B.
print "Probability site A is WORSE than site B: %.3f" % \
    (delta_samples < 0).mean()

print "Probability site A is BETTER than site B: %.3f" % \
    (delta_samples > 0).mean()
Probability site A is WORSE than site B: 0.384
Probability site A is BETTER than site B: 0.616

If this probability is too high for comfortable decision-making, we can perform more trials on site B (as site B has less samples to begin with, each additional data point for site B contributes more inferential "power" than each additional data point for site A).

Try playing with the parameters true_p_A, true_p_B, N_A, and N_B, to see what the posterior of $\text{delta}$ looks like. Notice in all this, the difference in sample sizes between site A and site B was never mentioned: it naturally fits into Bayesian analysis.

I hope the readers feel this style of A/B testing is more natural than hypothesis testing, which has probably confused more than helped practitioners. Later in this book, we will see two extensions of this model: the first to help dynamically adjust for bad sites, and the second will improve the speed of this computation by reducing the analysis to a single equation.

An algorithm for human deceit

Social data has an additional layer of interest as people are not always honest with responses, which adds a further complication into inference. For example, simply asking individuals "Have you ever cheated on a test?" will surely contain some rate of dishonesty. What you can say for certain is that the true rate is less than your observed rate (assuming individuals lie only about not cheating; I cannot imagine one who would admit "Yes" to cheating when in fact they hadn't cheated).

To present an elegant solution to circumventing this dishonesty problem, and to demonstrate Bayesian modeling, we first need to introduce the binomial distribution.

The Binomial Distribution

The binomial distribution is one of the most popular distributions, mostly because of its simplicity and usefulness. Unlike the other distributions we have encountered thus far in the book, the binomial distribution has 2 parameters: $N$, a positive integer representing $N$ trials or number of instances of potential events, and $p$, the probability of an event occurring in a single trial. Like the Poisson distribution, it is a discrete distribution, but unlike the Poisson distribution, it only weighs integers from $0$ to $N$. The mass distribution looks like:

$$P( X = k ) = {{N}\choose{k}} p^k(1-p)^{N-k}$$

If $X$ is a binomial random variable with parameters $p$ and $N$, denoted $X \sim \text{Bin}(N,p)$, then $X$ is the number of events that occurred in the $N$ trials (obviously $0 \le X \le N$). The larger $p$ is (while still remaining between 0 and 1), the more events are likely to occur. The expected value of a binomial is equal to $Np$. Below we plot the mass probability distribution for varying parameters.

In [27]:
figsize(12.5, 4)

import scipy.stats as stats
binomial = stats.binom

parameters = [(10, .4), (10, .9)]
colors = ["#348ABD", "#A60628"]

for i in range(2):
    N, p = parameters[i]
    _x = np.arange(N + 1)
    plt.bar(_x - 0.5, binomial.pmf(_x, N, p), color=colors[i],
            edgecolor=colors[i],
            alpha=0.6,
            label="$N$: %d, $p$: %.1f" % (N, p),
            linewidth=3)

plt.legend(loc="upper left")
plt.xlim(0, 10.5)
plt.xlabel("$k$")
plt.ylabel("$P(X = k)$")
plt.title("Probability mass distributions of binomial random variables");

The special case when $N = 1$ corresponds to the Bernoulli distribution. There is another connection between Bernoulli and Binomial random variables. If we have $X_1, X_2, ... , X_N$ Bernoulli random variables with the same $p$, then $Z = X_1 + X_2 + ... + X_N \sim \text{Binomial}(N, p )$.

The expected value of a Bernoulli random variable is $p$. This can be seen by noting the more general Binomial random variable has expected value $Np$ and setting $N=1$.

Example: Cheating among students
예: 학생들간 부정행위

We will use the binomial distribution to determine the frequency of students cheating during an exam. If we let $N$ be the total number of students who took the exam, and assuming each student is interviewed post-exam (answering without consequence), we will receive integer $X$ "Yes I did cheat" answers. We then find the posterior distribution of $p$, given $N$, some specified prior on $p$, and observed data $X$.

우리는 시험중에 학생들의 부정행위 빈도를 결정하기 위해 이항분포를 사용할 것이다. N을 시험을 치런 총 학생수로 놓고 각 생은 시험 후 인터뷰 된다고 (결과 없이 답하는 것) 가정하는 경우 우리는 정수 $X$ “예 내가 부정행위를 했습니다.”라는 답을 받을 것이다. 그 다음 우리는 $p$의 사후 분포, 주어진 $N$ , $p$에 대한 일부 정해진 사전 분포 그리고 관찰된 데이터 $X$를 발견한다.

This is a completely absurd model. No student, even with a free-pass against punishment, would admit to cheating. What we need is a better algorithm to ask students if they had cheated. Ideally the algorithm should encourage individuals to be honest while preserving privacy. The following proposed algorithm is a solution I greatly admire for its ingenuity and effectiveness:

In the interview process for each student, the student flips a coin, hidden from the interviewer. The student agrees to answer honestly if the coin comes up heads. Otherwise, if the coin comes up tails, the student (secretly) flips the coin again, and answers "Yes, I did cheat" if the coin flip lands heads, and "No, I did not cheat", if the coin flip lands tails. This way, the interviewer does not know if a "Yes" was the result of a guilty plea, or a Heads on a second coin toss. Thus privacy is preserved and the researchers receive honest answers.

이것은 완전히 황당한 모델이다. 심지어 처벌에서 자유로운 어떤 학생도 부정해위를 했다는 것을 인정 하지 않을 것이다. 우리가 필요한 것은 학생들이 부정행위를 했는지 학생들에게 묻는 더 나은 알고리즘 이다. 이상적으로 그 알고리즘은 프라이버시를 지키면서 정직해 지도록 개인들을 독려 해야 한다. 다음에 제안된 알고리즘이 내가 교묘함과 유효성에 아주 감탄하는 해결책이다.

각 학생에 대한 인터뷰 과정에서 학생은 인터뷰하는 사람에게서 감춰진 하나의 동전을 던진다. 학생은 동전이 앞면이 나올때 정직하게 답하는것에 동의한다. 아니면 동전이 뒷면이 나오면 학생은(비밀스럽게) 동전을 다시 던지고 동전이 앞면이 나오면 "예 나는 부정행위를 했습니다"라고 하고 뒷면이 나오면 "나는 부정행위를 하지 않았습니다"라고 대답한다. 이런 방법으로 인터뷰하는 사람은 "예" 가 유죄진술의 결과 였는지 두번째 동전 던지기에서 앞면이 나온 결과 였는지 모른다. 그래서 프라이버시는 지켜지고 연구자들은 정직한 답변을 받는다.

I call this the Privacy Algorithm. One could of course argue that the interviewers are still receiving false data since some Yes's are not confessions but instead randomness, but an alternative perspective is that the researchers are discarding approximately half of their original dataset since half of the responses will be noise. But they have gained a systematic data generation process that can be modeled. Furthermore, they do not have to incorporate (perhaps somewhat naively) the possibility of deceitful answers. We can use PyMC to dig through this noisy model, and find a posterior distribution for the true frequency of liars.

나는 이것을 프라이버시 알고리즘이라 부른다. 어떤 이는 물론 일부 "예“는 자백이 아니라 임의적인 것이기 때문에 인터뷰하는 사람은 여전히 거짓 데이터를 받고 있다고 논쟁할 수 있지만 다른 관점은 대답의 절반은 노이즈가 될 수 있기 때문에 연구자들은 원 데이터의 대략 절반은 무시한다는 것이다. 하지만 그들은 모델링이 될 수 있는 체계적인 데이터 생성 프로세스를 얻고 있다. 더 나아가 그들은 거짓 대답의 가능성을 받아들일(아마 다소 순진하게) 필요는 없다. 우리는 노이즈가 있는 모델을 조사하기 위해 PyMC를 사용할 수 있고 거짓말쟁이의 진실 빈도에 대한 사후 분포를 찾을 수 있다.

Suppose 100 students are being surveyed for cheating, and we wish to find $p$, the proportion of cheaters. There are a few ways we can model this in PyMC. I'll demonstrate the most explicit way, and later show a simplified version. Both versions arrive at the same inference. In our data-generation model, we sample $p$, the true proportion of cheaters, from a prior. Since we are quite ignorant about $p$, we will assign it a $\text{Uniform}(0,1)$ prior.

부정행위 대해서 100명의 학생이 조사된다고 가정하고 우리는 부정행위자들의 비율 $p$를 찾기를 바란다. 우리가 이것을 PyMC에서 모델링할 수 있는 몇 가지 방법들이 있다. 나는 가장 명확한 방법을 보여줄 것이고 나중에 간소화된 버전을 보여 줄 것이다. 두 버전 모두 동일한 추론에 도달한다. 우리의 데이터 생성 모델에서 우리는 사전확률로 부정행위자들의 진실 비율인 $p$를 표본 추출 한다. 우리는 $p$에 대해서 전혀 모르기 때문에 그것에 $\text{Uniform}(0,1)$ 사전확률을 부여 한다.

In [28]:
import pymc as pm

N = 100
p = pm.Uniform("freq_cheating", 0, 1)

Again, thinking of our data-generation model, we assign Bernoulli random variables to the 100 students: 1 implies they cheated and 0 implies they did not.

다시 우리의 데이터 생성 모델을 생각 해보자, 우리는 베르누이 랜덤 변수들을 100명의 학생들에게 배정한다. : 1은 그들이 속였다는것을 0은 그들이 속이지 않았다는 것을 의미한다.

In [29]:
true_answers = pm.Bernoulli("truths", p, size=N)

If we carry out the algorithm, the next step that occurs is the first coin-flip each student makes. This can be modeled again by sampling 100 Bernoulli random variables with $p=1/2$: denote a 1 as a Heads and 0 a Tails.

우리가 그 알고리즘을 수행 할 때 발생하는 다음 단계는 각 학생이 시행하는 첫번째 동전던지기 이다. 이것은 $p = 1/2$로 100개의 베르누이 랜덤 변수들을 샘플링 함으로서 다시 모델링 될 수 있다. : 1은 앞면 이고 0은 뒷면을 나타낸다.

True : 1, False : 0

In [30]:
first_coin_flips = pm.Bernoulli("first_flips", 0.5, size=N)
print first_coin_flips.value
[ True  True  True  True False False  True False False False False  True
  True False False  True False  True  True  True  True False  True False
 False  True False False  True  True  True  True False  True  True False
  True  True False False  True  True  True  True  True False  True  True
 False  True  True  True  True False False False False  True  True  True
 False False  True  True False False  True  True False  True False False
  True  True  True  True False  True  True  True  True False False False
  True False False  True  True False False  True  True False  True False
 False False  True False]

Although not everyone flips a second time, we can still model the possible realization of second coin-flips:

비록 모두가 두 번째 동전던지기를 하는 것은 아니지만 우리는 여전히 두 번째 동전던지기의 가능한 실현을 모델링 할 수 있다.

In [31]:
second_coin_flips = pm.Bernoulli("second_flips", 0.5, size=N)
a = true_answers * first_coin_flips * 1
print '-----t_a----------------------------'
print (true_answers*1).value
print '-----fc----------------------------'
print (first_coin_flips*1).value
print '-----fc*t_a--------------------------'
print a.value
print '-----1-fc--------------------------'
c = 1 - first_coin_flips
print c.value
print '------sc--------------------------'
print (second_coin_flips*1).value
print '------(1-fc)*sc--------------------------'
d =  c*second_coin_flips
print d.value
print '----fc*ta + (1-fc)*sc --------------'
k = a + d
print k.value
-----t_a----------------------------
[0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1
 0 1 1 1 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1
 0 1 0 0 1 1 1 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 1 0 1 1]
-----fc----------------------------
[1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1 0 1 1 0 1
 1 0 0 1 1 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 1
 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0]
-----fc*t_a--------------------------
[0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0 1
 0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1
 0 1 0 0 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0]
-----1-fc--------------------------
[0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 0 1 0
 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 0 0 1 0 1 1 0 0
 0 0 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 0 1 1 1 0 1]
------sc--------------------------
[0 1 0 0 1 1 1 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0
 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 1 1 1 0 1
 0 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 0]
------(1-fc)*sc--------------------------
[0 0 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0
 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0
 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0]
----fc*ta + (1-fc)*sc --------------
[0 0 0 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 1
 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 1 1 0 1 0 1 1 0 1 1 1 1
 0 1 1 0 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0]

Using these variables, we can return a possible realization of the observed proportion of "Yes" responses. We do this using a PyMC deterministic variable:

이러한 변수들을 사용해서 우리는 "예" 라는 대답들의 관찰된 비율의 가능한 실현을 돌려 줄 수 있다. 우리는 PyMC 결정적 변수를 사용해서 이것을 실행 한다.

In [32]:
@pm.deterministic
def observed_proportion(t_a=true_answers,
                        fc=first_coin_flips,
                        sc=second_coin_flips):

    observed = fc * t_a + (1 - fc) * sc
    return observed.sum() / float(N)

The line fc*t_a + (1-fc)*sc contains the heart of the Privacy algorithm. Elements in this array are 1 if and only if i) the first toss is heads and the student cheated or ii) the first toss is tails, and the second is heads, and are 0 else. Finally, the last line sums this vector and divides by float(N), produces a proportion.

fc*t_a + (1-fc)*sc 라인은 프라버시 알고리즘의 핵심을 포함하고 있다. 이 배열에 있는 요소들이 1 인 필요충분 조건은 i) 첫번째 던지기는 앞면이고 학생들이 속였거나 ii) 첫번째 던지기가 뒷면이고 두번째가 앞면인 경우 이고 그 밖에는 0이다. 마침내 마지막 라인은 이 벡터들을 합치고 float(N)로 나눠서 비율을 만들어 낸다.

In [33]:
observed_proportion.value
Out[33]:
0.66000000000000003

Next we need a dataset. After performing our coin-flipped interviews the researchers received 35 "Yes" responses. To put this into a relative perspective, if there truly were no cheaters, we should expect to see on average 1/4 of all responses being a "Yes" (half chance of having first coin land Tails, and another half chance of having second coin land Heads), so about 25 responses in a cheat-free world. On the other hand, if all students cheated, we should expect to see approximately 3/4 of all responses be "Yes".

The researchers observe a Binomial random variable, with N = 100 and p = observed_proportion with value = 35:

다음 우리는 데이터 세트가 필요하다. 우리의 동전던지기 인터뷰후에 연구자들은 35개의 "예"라는 대답을 얻었다. 이것을 상대적인 관점에 놓기 위해서 진짜 부정행위자가 없다면 우리는 모든 대답들중 평균 1/4이 "예"가 되는(첫번째 동전이 뒷면이 될 절반의 기회와 두번째 동전이 앞면이되는 다른 절반의 기회)것을 기대 해야 한다. 반면에 모든 학생들이 부정행위를 했다면 우리는 모든 대답들중 약 3/4이 "예"가 되는것을 기대해야 한다.

진짜 부정행위자가 전혀 없었다면 동전 앞면이 나온(1/2의 확률) 절반의 학생이 모두 "부정행위를 하지 않았다고" 답했을 것이고(그래서 기대값이 0) 나머지 절반의 학생중 동전 앞면이 나온 경우(1/2의 확률) "부정행위를 했다고" 대답(1/2 * 1/2)하기 때문에 부정행위의 비율은 1/4이다 모든 학생이 전부 "부정행위를 했다"면 동전던지기에서 앞면이 나온 절반의 학생이 "부정행위를 인정" 하고 즉 1/2이고 나머지 절반의 학생들 가운데 동던 던지기에서 앞면이 나와서 "부정행위를 인정"한 경우 1/4이어서 1/2 + 1/4 = 3/4 된다.

연구자들은 N = 100 이고 value = 35p = observed_proportion 이항 랜덤 변수를 관찰한다.:

In [34]:
X = 35

observations = pm.Binomial("obs", N, observed_proportion, observed=True,
                           value=X)

Below we add all the variables of interest to a Model container and run our black-box algorithm over the model.

아래에서 우리는 Model 콘테이너에 관심있는 모든 변수들을 더하고 그 모델에 대해서 우리의 블랙박스 알고리즘을 실행한다.

In [35]:
model = pm.Model([p, true_answers, first_coin_flips,
                  second_coin_flips, observed_proportion, observations])

# To be explained in Chapter 3!
mcmc = pm.MCMC(model)
mcmc.sample(40000, 15000)
 [-----------------100%-----------------] 40000 of 40000 complete in 11.9 sec
In [36]:
figsize(12.5, 3)
p_trace = mcmc.trace("freq_cheating")[:]
plt.hist(p_trace, histtype="stepfilled", normed=True, alpha=0.85, bins=30,
         label="posterior distribution", color="#348ABD")
plt.vlines([.05, .35], [0, 0], [5, 5], alpha=0.3)
plt.xlim(0, 1)
plt.legend();

With regards to the above plot, we are still pretty uncertain about what the true frequency of cheaters might be, but we have narrowed it down to a range between 0.05 to 0.35 (marked by the solid lines). This is pretty good, as a priori we had no idea how many students might have cheated (hence the uniform distribution for our prior). On the other hand, it is also pretty bad since there is a .3 length window the true value most likely lives in. Have we even gained anything, or are we still too uncertain about the true frequency?

위의 그래프에서 우리는 부정행위자들의 진짜 빈도 무엇이 될지 여전히 꽤 불확실 하지만 우리는 0.05에서 0.35 사이로(실선으로 표시된 것) 범위를 좁히고 있다. 이는 우리가 얼마나 많은 학생들이 부정행위를 했는지를 모를 때 사전확률로 상당히 좋다. (그런 이류로 우리의 사전확률에 대한 균일한 분포) 반면에 참값이 대부분 있을 것 같은 0.3 길의 창이 있어 상당히 나쁘다. 심지어 우리가 얻은 어떤 것이 있거나 진짜 빈도에 대해서 여전히 너무 불학실 한가?

I would argue, yes, we have discovered something. It is implausible, according to our posterior, that there are no cheaters, i.e. the posterior assigns low probability to $p=0$. Since we started with a uniform prior, treating all values of $p$ as equally plausible, but the data ruled out $p=0$ as a possibility, we can be confident that there were cheaters.

나는 우리는 어떤 것을 발견 했다고 “예”라고 주장할 것이다. 우리의 사후확률에 따르면 부정행위자가 없다는 것은 믿기 어려운 것이다 즉 사후확률은 $p=0$라는 것에 낮은 확률을 부여 한다.균일한 사전확률로 시작한 이후로 $p$의 모든 값들을 동일하게 그럴듯한 것으로 취급하고 있지만 데이터는 가능성으로 $p=0$을 배제하기 때문에 우리는 부정행위자들이 있었다고 확신하게 될 수 있다.

This kind of algorithm can be used to gather private information from users and be reasonably confident that the data, though noisy, is truthful.

이런 종류의 알고리즘은 사용자로부터 개인정보를 수집하기위해 사용될 있고 합리적으로 데이터가 비록 노이즈 있음에도 진실하다는 것을 확신할 수 있다.

Alternative PyMC Model

대체 PyMC 모델

Given a value for $p$ (which from our god-like position we know), we can find the probability the student will answer yes:

$p$에 대한 값이 주어진 다면 (우리가 알고 있는 신과 같은 위치로부터) 우리는 학생들이 “예”라고 답할 수 있는 확률을 찾을 수 있다.

\begin{align} P(\text{"Yes"}) = & P( \text{Heads on first coin} )P( \text{cheater} ) + P( \text{Tails on first coin} )P( \text{Heads on second coin} ) \\\\ & = \frac{1}{2}p + \frac{1}{2}\frac{1}{2}\\\\ & = \frac{p}{2} + \frac{1}{4} \end{align}

Thus, knowing $p$ we know the probability a student will respond "Yes". In PyMC, we can create a deterministic function to evaluate the probability of responding "Yes", given $p$:

그래서 알고 있는 $p$로부터 우리는 학생이 “예”라고 대답할 확률을 안다. PyMC에서 우리는 주어진 $p$에 대해서 “예”라고 대답하는 확률을 계산하기 위해 결정론적 함수를 생성할 수 있다:

In [37]:
p = pm.Uniform("freq_cheating", 0, 1)


@pm.deterministic
def p_skewed(p=p):
    return 0.5 * p + 0.25

I could have typed p_skewed = 0.5*p + 0.25 instead for a one-liner, as the elementary operations of addition and scalar multiplication will implicitly create a deterministic variable, but I wanted to make the deterministic boilerplate explicit for clarity's sake.

나는 더하기와 스칼라 곱하기의 기초 연산이 함축적으로 결정론적 변수를 생성할 수 있는 것으로 한줄 대신 p_skewed = 0.5*p + 0.25를 타이핑 할 수 있지만 나는 명확성을 위해 명시적인 결정론적 표준을 만들기를 원한다.

If we know the probability of respondents saying "Yes", which is p_skewed, and we have $N=100$ students, the number of "Yes" responses is a binomial random variable with parameters N and p_skewed.

우리가 “예”라고 말하는 응답자들의 p_skewed 확률을 알고 $N=100$인 학생들이 있다면 “예”의 수는 Np_skewed 파라미터를 가진 이항 랜덤 변수가 된다.

This is where we include our observed 35 "Yes" responses. In the declaration of the pm.Binomial, we include value = 35 and observed = True.

이는 우리가 우리의 관찰된 35개의 “예” 응답을 포함하는 것이다. pm.Binomial의 선언에서 우리는 value = 35observed = True를 포함 한다.

In [38]:
yes_responses = pm.Binomial("number_cheaters", 100, p_skewed,
                            value=35, observed=True)

Below we add all the variables of interest to a Model container and run our black-box algorithm over the model.

In [39]:
model = pm.Model([yes_responses, p_skewed, p])

# To Be Explained in Chapter 3!
mcmc = pm.MCMC(model)
mcmc.sample(25000, 2500)
 [-----------------100%-----------------] 25000 of 25000 complete in 1.3 sec
In [40]:
figsize(12.5, 3)
p_trace = mcmc.trace("freq_cheating")[:]
plt.hist(p_trace, histtype="stepfilled", normed=True, alpha=0.85, bins=30,
         label="posterior distribution", color="#348ABD")
plt.vlines([.05, .35], [0, 0], [5, 5], alpha=0.2)
plt.xlim(0, 1)
plt.legend();

More PyMC Tricks

더 많은 PyMC 기법들

Protip: Lighter deterministic variables with Lambda class

Protip: Lambda class를 가진 Lighter 결정론적 변수들

Sometimes writing a deterministic function using the @pm.deterministic decorator can seem like a chore, especially for a small function. I have already mentioned that elementary math operations can produce deterministic variables implicitly, but what about operations like indexing or slicing? Built-in Lambda functions can handle this with the elegance and simplicity required. For example,

때로는 특히 작은 함수에 대해서 @pm.deterministic 데코레이터를 사용해서 결정론적 함수를 쓰는 것은 잡일처럼 보일 수 있다.나는 이미 기초 수학 연산들은 결정론적 변수들을 암묵적으로 생산 할수 있다는 것을 언급했다. 하지만 인덱싱과 슬라이싱 같은 연산자들에 대해서 어떤가? 내장된 Lambda함수는 이것을 필요한 간결함과 단순함으로 다룰 수 있다. 예를들어,

beta = pm.Normal("coefficients", 0, size=(N, 1))
x = np.random.randn((N, 1))
linear_combination = pm.Lambda(lambda x=x, beta=beta: np.dot(x.T, beta))

Protip: Arrays of PyMC variables

Pritip: PyMC 변수들의 배열들

There is no reason why we cannot store multiple heterogeneous PyMC variables in a Numpy array. Just remember to set the dtype of the array to object upon initialization. For example:

왜 우리는 Numpy배열에 다양한 이종 PyMC 변수들을 저장할 수 없는지 모른다. 다만 초기화시에 배열의 dtypeobject로 설정 하는것만 기억해라. 예를들어,

In [41]:
N = 10
x = np.empty(N, dtype=object)
for i in range(0, N):
    x[i] = pm.Exponential('x_%i' % i, (i + 1) ** 2)

The remainder of this chapter examines some practical examples of PyMC and PyMC modeling: 이 장의 나머지는 몇가지 실용적인 PyMC의 예와 PyMC모델링을 조사한다.

Example: Challenger Space Shuttle Disaster
Example: 챌린저호의 참사

On January 28, 1986, the twenty-fifth flight of the U.S. space shuttle program ended in disaster when one of the rocket boosters of the Shuttle Challenger exploded shortly after lift-off, killing all seven crew members. The presidential commission on the accident concluded that it was caused by the failure of an O-ring in a field joint on the rocket booster, and that this failure was due to a faulty design that made the O-ring unacceptably sensitive to a number of factors including outside temperature. Of the previous 24 flights, data were available on failures of O-rings on 23, (one was lost at sea), and these data were discussed on the evening preceding the Challenger launch, but unfortunately only the data corresponding to the 7 flights on which there was a damage incident were considered important and these were thought to show no obvious trend. The data are shown below (see [1]):

1986년 1월28일에 이륙 후 바로 챌린저호의 로켓 부스터중 하나가 폭발 했을 때 7명의 승무원 모두 사망 하면서 미국의 우주왕복선 프로그램의 25번째 비행이 참사로 끝났다. 사고에 대해 대통령 직속 조사단은 로켓 부스터에 연결 분야에 O-링의 결함으로 발생 했고 이 결함은 외부 온도를 포함해서 많은 요인에 믿을 수 없을 정도로 민감한 O-링을 만든 잘못된 디자인 때문 이었다고 결론지었다. 이전 24번의 비행에 대해서, 데이터는 23번의 비행에 대해 O-링의 결함에 유용 했다(하나는 바다에서 잃어 버렸다) 그리고 이러한 데이터는 챌린저호 발사를 진행하는 저녁에 논의 되었다. 하지만 불행하게도 손상 사고가 발생했던 7번째 비행에 해당하는 데이터가 중요하게 고려 되었고 이는 명백한 경향이 없는 것으로 생각 되었다. 아래 데이터가 보인다(see [1]):

In [42]:
figsize(12.5, 3.5)
np.set_printoptions(precision=3, suppress=True)
challenger_data = np.genfromtxt("data/challenger_data.csv", skip_header=1,
                                usecols=[1, 2], missing_values="NA",
                                delimiter=",")
# drop the NA values
challenger_data = challenger_data[~np.isnan(challenger_data[:, 1])]

# plot it, as a function of tempature (the first column)
print "Temp (F), O-Ring failure?"
print challenger_data

plt.grid()
plt.scatter(challenger_data[:, 0], challenger_data[:, 1], s=75, color="k",
            alpha=0.5)
plt.yticks([0, 1])
plt.ylabel("Damage Incident?")
plt.xlabel("Outside temperature (Fahrenheit)")
plt.title("Defects of the Space Shuttle O-Rings vs temperature")
Temp (F), O-Ring failure?
[[ 66.   0.]
 [ 70.   1.]
 [ 69.   0.]
 [ 68.   0.]
 [ 67.   0.]
 [ 72.   0.]
 [ 73.   0.]
 [ 70.   0.]
 [ 57.   1.]
 [ 63.   1.]
 [ 70.   1.]
 [ 78.   0.]
 [ 67.   0.]
 [ 53.   1.]
 [ 67.   0.]
 [ 75.   0.]
 [ 70.   0.]
 [ 81.   0.]
 [ 76.   0.]
 [ 79.   0.]
 [ 75.   1.]
 [ 76.   0.]
 [ 58.   1.]]

Out[42]:
<matplotlib.text.Text at 0x1a070780>

It looks clear that the probability of damage incidents occurring increases as the outside temperature decreases. We are interested in modeling the probability here because it does not look like there is a strict cutoff point between temperature and a damage incident occurring. The best we can do is ask "At temperature $t$, what is the probability of a damage incident?". The goal of this example is to answer that question.

손상 사고 발생의 확률이 외부 온도가 감소함에 따라 증가한다는 것이 명확하게 보인다. 우리는 여기서 온도와 손상 사고 발생사이의 엄격한 차단점이 있을 거 같아 보이지 않기 때문에 확률을 모델링 하는데 관심이 있다. 우리가 할 수 있는 최선은 “온도 $t$에서 손실 사고의 확률은 무엇인가 ?”를 묻는 것이다. 이 예의 목적은 그 질문에 답하는 것 이다.

We need a function of temperature, call it $p(t)$, that is bounded between 0 and 1 (so as to model a probability) and changes from 1 to 0 as we increase temperature. There are actually many such functions, but the most popular choice is the logistic function.

우리는 온도의 함수가 필요 하고 $p(t)$ 라고 부르고 0과 1사이에 범위에 있고 우리가 온도를 증가시킴에 따라 1에서부터 0까지 변한다. 실제로 그와 같은 함수는 많이 있지만 가장 인기 있는 선택은 logistic function 이다.

$$p(t) = \frac{1}{ 1 + e^{ \;\beta t } } $$

In this model, $\beta$ is the variable we are uncertain about. Below is the function plotted for $\beta = 1, 3, -5$.

이 모델에서 $\beta$ 는 불확실성의 변수 이다. 아래 $\beta = 1, 3, -5$에 대해 함수 도표가 있다.

In [43]:
figsize(12, 3)


def logistic(x, beta):
    return 1.0 / (1.0 + np.exp(beta * x))

x = np.linspace(-4, 4, 100)
plt.plot(x, logistic(x, 1), label=r"$\beta = 1$")
plt.plot(x, logistic(x, 3), label=r"$\beta = 3$")
plt.plot(x, logistic(x, -5), label=r"$\beta = -5$")
plt.legend();

But something is missing. In the plot of the logistic function, the probability changes only near zero, but in our data above the probability changes around 65 to 70. We need to add a bias term to our logistic function:

하지만 무언가 놓친 것이 있다. Logistic 함수의 도표에서 확률은 단지 0근처에서 변하지만 위에 우리의 데이터에서는 확률이 65에서 70 근처에서 변한다. Logistic 함수에 편향(bias)를 추가 하는 것이 필요 하다.

$$p(t) = \frac{1}{ 1 + e^{ \;\beta t + \alpha } } $$

Some plots are below, with differing $\alpha$.

상이한 $\alpha$ 에 대한 몇 개의 도표가 아래 있다.

In [44]:
def logistic(x, beta, alpha=0):
    return 1.0 / (1.0 + np.exp(np.dot(beta, x) + alpha))

x = np.linspace(-4, 4, 100)

plt.plot(x, logistic(x, 1), label=r"$\beta = 1$", ls="--", lw=1)
plt.plot(x, logistic(x, 3), label=r"$\beta = 3$", ls="--", lw=1)
plt.plot(x, logistic(x, -5), label=r"$\beta = -5$", ls="--", lw=1)

plt.plot(x, logistic(x, 1, 1), label=r"$\beta = 1, \alpha = 1$",
         color="#348ABD")
plt.plot(x, logistic(x, 3, -2), label=r"$\beta = 3, \alpha = -2$",
         color="#A60628")
plt.plot(x, logistic(x, -5, 7), label=r"$\beta = -5, \alpha = 7$",
         color="#7A68A6")

plt.legend(loc="lower left");

Adding a constant term $\alpha$ amounts to shifting the curve left or right (hence why it is called a bias).

상수항 $\alpha$을 추가하는 것은 커브를 왼쪽 또는 오른쪽으로 이동하게 한다. (그래서 그것을 편향(bias)라고 부르는 것 이다)

Let's start modeling this in PyMC. The $\beta, \alpha$ parameters have no reason to be positive, bounded or relatively large, so they are best modeled by a Normal random variable, introduced next.

PyMC에서 이것을 모델링을 시작하자. $\beta, \alpha$ 파라미터들은 양수 이거나, 구간이 있거나, 상대적으로 클 필요는 없다 그래서 그들은 정규 랜덤 변수로 가장 잘 모델링 될 수 있다. 다음에 소개 된다.

Normal distributions

정규분포

A Normal random variable, denoted $X \sim N(\mu, 1/\tau)$, has a distribution with two parameters: the mean, $\mu$, and the precision, $\tau$. Those familiar with the Normal distribution already have probably seen $\sigma^2$ instead of $\tau^{-1}$. They are in fact reciprocals of each other. The change was motivated by simpler mathematical analysis and is an artifact of older Bayesian methods. Just remember: the smaller $\tau$, the larger the spread of the distribution (i.e. we are more uncertain); the larger $\tau$, the tighter the distribution (i.e. we are more certain). Regardless, $\tau$ is always positive.

$X \sim N(\mu, 1/\tau)$로 표시되는 정규 랜덤 변수는 두 개의 파라미터가 있는 분포를 가진다 : 평균, $\mu$, 그리고 정밀도, $\tau$. 정규분포로 친숙한 사람들은 아마 이미 $\tau^{-1}$ 대신 $\sigma^2$를 볼 수 있다. 그들은 사실 서로 역수관계 이다. 변화는 더 단순한 수학적 분석의해 동기가 부여 되었고 이전 베이시안 방법의 유물 이다. 그냥 기억하기 : $\tau$ 작을수록 분포의 확산은 더 커진다.(즉 우리는 더 불확실 하게 된다.); $\tau$ 기 커질수록 분포는 좁아진다.(즉 우리는 더 명확해진다.) 상관없이 $\tau$는 항상 양수이다.

The probability density function of a $N( \mu, 1/\tau)$ random variable is:

랜덤 변수 $N( \mu, 1/\tau)$의 확률밀도함수는

$$ f(x | \mu, \tau) = \sqrt{\frac{\tau}{2\pi}} \exp\left( -\frac{\tau}{2} (x-\mu)^2 \right) $$

We plot some different density functions below.

우리는 약간 다른 밀도 함수들을 아래에 도표로 나타낸다.

In [45]:
import scipy.stats as stats

nor = stats.norm
x = np.linspace(-8, 7, 150)
mu = (-2, 0, 3)
tau = (.7, 1, 2.8)
colors = ["#348ABD", "#A60628", "#7A68A6"]
parameters = zip(mu, tau, colors)

for _mu, _tau, _color in parameters:
    plt.plot(x, nor.pdf(x, _mu, scale=1. / _tau),
             label="$\mu = %d,\;\\tau = %.1f$" % (_mu, _tau), color=_color)
    plt.fill_between(x, nor.pdf(x, _mu, scale=1. / _tau), color=_color,
                     alpha=.33)

plt.legend(loc="upper right")
plt.xlabel("$x$")
plt.ylabel("density function at $x$")
plt.title("Probability distribution of three different Normal random \
variables");

A Normal random variable can be take on any real number, but the variable is very likely to be relatively close to $\mu$. In fact, the expected value of a Normal is equal to its $\mu$ parameter:

정규 랜덤 변수는 임의의 실수로 주어질 수 있지만 변수는 매우 상대적으로 $\mu$에 가까울 것이다. 사실 정규분포의 기댓값은 그것의 $\mu$ 파라미터와 동일하다:

$$ E[ X | \mu, \tau] = \mu$$

and its variance is equal to the inverse of $\tau$:

그리고 그것의 분산은 $\tau$의 역수와 동일하다:

$$Var( X | \mu, \tau ) = \frac{1}{\tau}$$

Below we continue our modeling of the Challenger space craft:

아래에서 우리는 챌린저호에 대한 모델링을 계속 한다.

In [46]:
import pymc as pm

temperature = challenger_data[:, 0]
print 'temperature',temperature
D = challenger_data[:, 1]  # defect or not?
print 'D',D

# notice the`value` here. We explain why below.
beta = pm.Normal("beta", 0, 0.001, value=0)
print 'beta',beta.value
alpha = pm.Normal("alpha", 0, 0.001, value=0)
print 'alpha',alpha.value

@pm.deterministic
def p(t=temperature, alpha=alpha, beta=beta):
    return 1.0 / (1. + np.exp(beta * t + alpha))
temperature [ 66.  70.  69.  68.  67.  72.  73.  70.  57.  63.  70.  78.  67.  53.  67.
  75.  70.  81.  76.  79.  75.  76.  58.]
D [ 0.  1.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  1.  0.  0.  0.  0.
  0.  0.  1.  0.  1.]
beta 0.0
alpha 0.0

We have our probabilities, but how do we connect them to our observed data? A Bernoulli random variable with parameter $p$, denoted $\text{Ber}(p)$, is a random variable that takes value 1 with probability $p$, and 0 else. Thus, our model can look like:

우리는 우리의 확률을 가지고 있지만 어떻게 우리가 그들을 우리의 관찰된 데이터에 연결 하는가? 파라미터 $p$를 가진 $\text{Ber}(p)$로 표시되는 Bernoulli 랜덤 변수는 확률 $p$로 1의 값을 취하고 그 외는 0을 취하는 랜덤 변수이다. 그래서 우리 모델은 아래와 같이 보일 수 있다:

$$ \text{Defect Incident, $D_i$} \sim \text{Ber}( \;p(t_i)\; ), \;\; i=1..N$$

where $p(t)$ is our logistic function and $t_i$ are the temperatures we have observations about. Notice in the above code we had to set the values of beta and alpha to 0. The reason for this is that if beta and alpha are very large, they make p equal to 1 or 0. Unfortunately, pm.Bernoulli does not like probabilities of exactly 0 or 1, though they are mathematically well-defined probabilities. So by setting the coefficient values to 0, we set the variable p to be a reasonable starting value. This has no effect on our results, nor does it mean we are including any additional information in our prior. It is simply a computational caveat in PyMC.

$p(t)$는 우리의 Logistic 함수이고 $t_i$는 우리가 관찰한 온도이다. 위의 코드에서 우리는 betaalpha를 0으로 설정 했다는 것을 주목 해야 한다. 이에 대한 이유는 betaalpha가 너무 크면 p는 1 또는 0이 되기 때문 이다. 불행히도 pm.Bernoulli는 비록 수학적으로 잘 정의된 확률이지만 정확하게 0 또는 1의 확률 같지는 않을 것이다. 그래서 계수값을 0로 설정 함으로서 우리는 p 값을 합리적인 시작 값으로 설정 한다. 이것은 우리의 결과에 영향을 주지도 않고 우리가 사전확률에 어떤 부가적인 정보를 포함한다는 것도 아니다. 그것은 단순히 PyMC에서 계산적 주의(경고)이다.

In [47]:
p.value
Out[47]:
array([ 0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,
        0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,
        0.5])
In [48]:
# connect the probabilities in `p` with our observations through a
# Bernoulli random variable.
observed = pm.Bernoulli("bernoulli_obs", p, value=D, observed=True)
print (observed*1).value
model = pm.Model([observed, beta, alpha])

# Mysterious code to be explained in Chapter 3
map_ = pm.MAP(model)
map_.fit()
mcmc = pm.MCMC(model)
mcmc.sample(120000, 100000, 2)
[0 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1]
 [-----------------100%-----------------] 120000 of 120000 complete in 10.6 sec

We have trained our model on the observed data, now we can sample values from the posterior. Let's look at the posterior distributions for $\alpha$ and $\beta$:

우리는 관찰된 데이터에 대한 우리의 모델을 학습 시켰다. 지금 우리는 사후확률로부터 값들을 표본추출 할 수 있다. $\alpha$ 과 $\beta$에 대한 사후확률분포를 살펴보자:

In [49]:
alpha_samples = mcmc.trace('alpha')[:, None]  # best to make them 1d
beta_samples = mcmc.trace('beta')[:, None]

figsize(12.5, 6)

# histogram of the samples:

plt.subplot(211)
plt.title(r"Posterior distributions of the variables $\alpha, \beta$")
plt.hist(beta_samples, histtype='stepfilled', bins=35, alpha=0.85,
         label=r"posterior of $\beta$", color="#7A68A6", normed=True)
plt.legend()
plt.grid()

plt.subplot(212)
plt.hist(alpha_samples, histtype='stepfilled', bins=35, alpha=0.85,
         label=r"posterior of $\alpha$", color="#A60628", normed=True)
plt.legend();
plt.grid()

All samples of $\beta$ are greater than 0. If instead the posterior was centered around 0, we may suspect that $\beta = 0$, implying that temperature has no effect on the probability of defect.

$\beta$의 모든 표본 값들은 0보다 크다. 사후확률 값이 0주변에 집중되었다면 우리는 온도가 결함의 확률에 아무런 영향이 없다는 것을 암시하는 $\beta = 0$를 의심 할 것이다.

Similarly, all $\alpha$ posterior values are negative and far away from 0, implying that it is correct to believe that $\alpha$ is significantly less than 0.

마찬가지로 모든 $\alpha$ 사후확률 값들은 음수이고 0과는 거리가 멀다. 이는 $\alpha$가 0보다 상당히 작다고 믿는 것이 정확하다는 것을 암시하는 것이다.

Regarding the spread of the data, we are very uncertain about what the true parameters might be (though considering the low sample size and the large overlap of defects-to-nondefects this behaviour is perhaps expected).

데이터의 폭넓은에 관해서 우리는 진정한 파라미터가 무엇이 될지에 대해 매우 불확실 하다. (비록 작은 표본 크기와 결함 대 비결함의 많은 겹침을 고려하더라도 이런 행동이 어쩌면 예상된다.)

Next, let's look at the expected probability for a specific value of the temperature. That is, we average over all samples from the posterior to get a likely value for $p(t_i)$.

다음으로 온도의 특별한 값에 대한 기대확률(expected probability)을 살펴 보자. 즉 우리는 $p(t_i)$에 대해 가능성 있을 값을 얻기 위해 사후확률로부터 모든 표본에 대해 평균을 낸다.

In [50]:
t = np.linspace(temperature.min() - 5, temperature.max() + 5, 50)[:, None]
print 'beta_samples',len(beta_samples),beta_samples
print 'alpha_samples',len(alpha_samples),alpha_samples
p_t = logistic(t.T, beta_samples, alpha_samples)
print 'p_t',len(p_t),len(p_t[0]),p_t
print 'p_t[0, :]',len(p_t[0, :]),p_t[0, :]

#온도별 평균(세로축으로 평균값을 냄)
mean_prob_t = p_t.mean(axis=0)
print 'mean_prob_t',mean_prob_t
beta_samples 10000 [[ 0.136]
 [ 0.137]
 [ 0.145]
 ..., 
 [ 0.241]
 [ 0.24 ]
 [ 0.254]]
alpha_samples 10000 [[ -6.71 ]
 [ -9.426]
 [ -9.426]
 ..., 
 [-16.404]
 [-16.404]
 [-16.404]]
p_t 10000 50 [[ 0.54   0.513  0.487 ...,  0.008  0.007  0.007]
 [ 0.945  0.939  0.933 ...,  0.104  0.094  0.085]
 [ 0.923  0.915  0.906 ...,  0.058  0.052  0.047]
 ..., 
 [ 0.992  0.99   0.988 ...,  0.018  0.015  0.013]
 [ 0.992  0.991  0.989 ...,  0.02   0.017  0.014]
 [ 0.985  0.982  0.978 ...,  0.006  0.005  0.004]]
p_t[0, :] 50 [ 0.54   0.513  0.487  0.461  0.434  0.409  0.383  0.359  0.335  0.312
  0.289  0.268  0.248  0.229  0.21   0.193  0.177  0.162  0.149  0.136
  0.124  0.113  0.103  0.093  0.085  0.077  0.07   0.063  0.057  0.052
  0.047  0.042  0.038  0.034  0.031  0.028  0.025  0.023  0.021  0.019
  0.017  0.015  0.014  0.012  0.011  0.01   0.009  0.008  0.007  0.007]
mean_prob_t [ 0.959  0.953  0.947  0.941  0.933  0.924  0.915  0.904  0.891  0.877
  0.861  0.843  0.823  0.801  0.776  0.748  0.718  0.685  0.649  0.61
  0.569  0.526  0.482  0.438  0.395  0.352  0.312  0.275  0.241  0.21
  0.183  0.159  0.138  0.119  0.103  0.09   0.078  0.067  0.059  0.051
  0.044  0.039  0.034  0.029  0.026  0.023  0.02   0.017  0.015  0.013]

In [51]:
figsize(12.5, 4)

plt.plot(t, mean_prob_t, lw=3, label="average posterior \nprobability \
of defect")
plt.plot(t, p_t[0, :], ls="--", label="realization from posterior")
plt.plot(t, p_t[-2, :], ls="--", label="realization from posterior")
plt.scatter(temperature, D, color="k", s=50, alpha=0.5)
plt.title("Posterior expected value of probability of defect; \
plus realizations")
plt.legend(loc="lower left")
plt.ylim(-0.1, 1.1)
plt.xlim(t.min(), t.max())
plt.ylabel("probability")
plt.xlabel("temperature");

Above we also plotted two possible realizations of what the actual underlying system might be. Both are equally likely as any other draw. The blue line is what occurs when we average all the 20000 possible dotted lines together.

위에서 우리는 또한 실제 내부 시스템이 뭐가 될지에 대한 두 개의 가능한 실현물을 그래프로 나타냈다. 둘 다 어떤 다른 추첨으로도 동일하게 가능하다. 파란선은 20000개의 가능한 점선들을 같이 평균했을 때 발생하는 것이다.

An interesting question to ask is for what temperatures are we most uncertain about the defect-probability? Below we plot the expected value line and the associated 95% intervals for each temperature.

물어볼 흥미로운 질문은 어떤 온도가 결함 확률에 대해 우리가 가장 불확실한 것인가?에 대한 것이다. 아래에 우리는 각 온도에 대해서 기대값 라인과 연관된 95% 구간 그래프로 표시한다.

In [52]:
from scipy.stats.mstats import mquantiles

# vectorized bottom and top 2.5% quantiles for "confidence interval"
qs = mquantiles(p_t, [0.025, 0.975], axis=0)
plt.fill_between(t[:, 0], *qs, alpha=0.7,
                 color="#7A68A6")

plt.plot(t[:, 0], qs[0], label="95% CI", color="#7A68A6", alpha=0.7)

plt.plot(t, mean_prob_t, lw=1, ls="--", color="k",
         label="average posterior \nprobability of defect")

plt.xlim(t.min(), t.max())
plt.ylim(-0.02, 1.02)
plt.legend(loc="lower left")
plt.scatter(temperature, D, color="k", s=50, alpha=0.5)
plt.xlabel("temp, $t$")

plt.ylabel("probability estimate")
plt.title("Posterior probability estimates given temp. $t$");
plt.grid()

The 95% credible interval, or 95% CI, painted in purple, represents the interval, for each temperature, that contains 95% of the distribution. For example, at 65 degrees, we can be 95% sure that the probability of defect lies between 0.25 and 0.75.

보라색으로 칠해진 95% 신뢰구간 또는 95% CI는 각 온도에 대해서 분포의 95%를 포함하는 구간을 나타낸다. 예를 들어 65도에서 우리는 0.25와 0.75사이에 결함 확률이 있다고 95% 확신할 수 있다.

More generally, we can see that as the temperature nears 60 degrees, the CI's spread out over [0,1] quickly. As we pass 70 degrees, the CI's tighten again. This can give us insight about how to proceed next: we should probably test more O-rings around 60-65 temperature to get a better estimate of probabilities in that range. Similarly, when reporting to scientists your estimates, you should be very cautious about simply telling them the expected probability, as we can see this does not reflect how wide the posterior distribution is.

더 일반적으로 우리는 온도가 60도에 가까워지면서 CI가 빠르게 [0,1]에 걸쳐 퍼지는 것을 볼 수 있다. 70도를 넘어 가면서 CI는 다시 얇아진다. 이는 우리에게 다음을 진행하는 방법에 대한 통찰력을 준다. 우리는 그 범위에서 더 나은 확률들의 추정을 얻기 위해 아마 60-65 온도 주변에서 더 많은 O-링들을 테스트해야 한다. 마찬가지로 여러분의 추정치들을 과학자들에게 보고할 때 우리는 이것이 얼마나 넓은 사후확률분포가 있는지를 반영하지 않은 것을 알수 있기에 여러분은 단순하게 기대 확률을 그들에게 말하는 것에 대해서 매우 주의해야 한다.

What about the day of the Challenger disaster?

챌린저호 참사의 날은 어떤가?

On the day of the Challenger disaster, the outside temperature was 31 degrees Fahrenheit. What is the posterior distribution of a defect occurring, given this temperature? The distribution is plotted below. It looks almost guaranteed that the Challenger was going to be subject to defective O-rings.

챌린저호 참사의 날에 외부 온도는 화씨 31도 이었다. 이 온도가 주어질 때 결함이 발생하는 사후확률분포는 무엇인가? 분포는 아래 그래프에 나타난다. 그것은 챌린저호가 결함 있는 O-링들에 지배를 받았을 것이라는 것을 거의 보장하는 것으로 보인다.

In [53]:
figsize(12.5, 2.5)

prob_31 = logistic(31, beta_samples, alpha_samples)
print 'prob_31',len(prob_31),prob_31[prob_31<0.999]
plt.xlim(0.995, 1)
plt.hist(prob_31, bins=1000, normed=True, histtype='stepfilled')
plt.title("Posterior distribution of probability of defect, given $t = 31$")
plt.xlabel("probability of defect occurring in O-ring");
plt.grid()
prob_31 10000 [ 0.923  0.994  0.993 ...,  0.999  0.999  0.999]

Is our model appropriate?

The skeptical reader will say "You deliberately chose the logistic function for $p(t)$ and the specific priors. Perhaps other functions or priors will give different results. How do I know I have chosen a good model?" This is absolutely true. To consider an extreme situation, what if I had chosen the function $p(t) = 1,\; \forall t$, which guarantees a defect always occurring: I would have again predicted disaster on January 28th. Yet this is clearly a poorly chosen model. On the other hand, if I did choose the logistic function for $p(t)$, but specified all my priors to be very tight around 0, likely we would have very different posterior distributions. How do we know our model is an expression of the data? This encourages us to measure the model's goodness of fit.

의심 많은 독자는 “당신은 $p(t)$에 대해 의도적으로 Logistic 함수를 선택 했고 특정 사전확률 선택했다. 아마 다른 함수 또는 사전확률은 다른 결과를 줄 것이다. 내가 훌륭한 모델을 선택 했다고 내가 어떻게 아는가?”라고 말 할 것이다. 이것은 절대적으로 진실이다. 극단적인 상황을 고려하기 위해 만약 내가 결함이 항상 발생하는 것을 보증하는 함수 $p(t) = 1,\; \forall t$,를 선택 했다면 어떻게 될 것인가 : 나는 다시 1월28일에 참사를 예측 했을 것이다. 그래 이것은 확실히 형편없이 선택된 모델이다. 하지만 내가 $p(t)$에 대해 Logistic함수를 했지만 모든 나의 사전확률을 0주변에 아주 가깝게 있도록 지정 했다면 아마 우리는 매우 다른 사후확률분포를 얻게 될 것이다. 우리는 어떻게 우리 모델이 데이터의 표현이라는 것을 아는가? 이는 우리가 모델의 적합도를 측정하도록 독려한다.

We can think: how can we test whether our model is a bad fit? An idea is to compare observed data (which if we recall is a fixed stochastic variable) with artificial dataset which we can simulate. The rationale is that if the simulated dataset does not appear similar, statistically, to the observed dataset, then likely our model is not accurately represented the observed data.

우리는 어떻게 우리가 우리 모델이 부적절 한지 아닌지 테스트 할 수 있는가?를 생각할 수 있다. 아이디어는 관찰된 데이터(우리가 기억하는 것은 고정된 확률 변수)와 우리가 시뮬레이션 할 수 있는 인공 데이터 세트를 비교하는 것이다. 근거는 시뮬레이션된 데이터 세트가 관찰된 데이터 세트와 통계적으로 유사해 보이지 않는다면 아마 우리의 모델은 관찰된 데이터를 정확하게 나타내지 못한다는 것이다.

Previously in this Chapter, we simulated artificial dataset for the SMS example. To do this, we sampled values from the priors. We saw how varied the resulting datasets looked like, and rarely did they mimic our observed dataset. In the current example, we should sample from the posterior distributions to create very plausible datasets. Luckily, our Bayesian framework makes this very easy. We only need to create a new Stochastic variable, that is exactly the same as our variable that stored the observations, but minus the observations themselves. If you recall, our Stochastic variable that stored our observed data was:

이 장에서 이전에 우리는 SMS 예제에 대해서 인공 데이터 세트를 시뮬레이션 했다. 이것을 하기 위해 우리는 사전확률로부터 값들을 표본추출 했다. 우리는 다양한 결과 데이터 세트가 어떻게 보이는지 보았고 그들은 우리의 관찰된 데이터 세트를 거의 모방 하지 않았다. 현재 예제에서 우리는 매우 그럴듯한 데이터 세트를 만들기 위해 사후확률 분포로부터 표본추출 해야 한다. 다행히 우리의 베이지언 프레임워크가 이것을 매우 쉽게 만든다. 우리는 단지 관측 (데이터를)을 저장 했던 우리의 변수와 정확하게 같은 새로운 확률 변수를 만들기만 하면 되지만 관측 (데이터) 그 자체는 빼야 한다. 여러분이 우리의 관측된 데이터를 저장했던 확률 변수가 있었다는 것을 상기 하는 경우:

observed = pm.Bernoulli( "bernoulli_obs", p, value=D, observed=True)

Hence we create:

그래서 우리는 다음을 생성한다:

simulated_data = pm.Bernoulli("simulation_data", p)

Let's simulate 10 000:

10 000개를 시뮬레이션 해보자

In [54]:
print 'p',p.value
simulated = pm.Bernoulli("bernoulli_sim", p)
N = 10000

mcmc = pm.MCMC([simulated, alpha, beta, observed])
mcmc.sample(N)
p [ 0.407  0.199  0.242  0.292  0.347  0.13   0.104  0.199  0.871  0.595
  0.199  0.031  0.347  0.949  0.347  0.065  0.199  0.015  0.051  0.025
  0.065  0.051  0.84 ]
 [-----------------100%-----------------] 10000 of 10000 complete in 1.6 sec
In [55]:
figsize(12.5, 5)

simulations = mcmc.trace("bernoulli_sim")[:]
print simulations.shape

plt.title("Simulated dataset using posterior parameters")
figsize(12.5, 6)
for i in range(4):
    ax = plt.subplot(4, 1, i + 1)
    plt.scatter(temperature, simulations[1000 * i, :], color="k",
                s=50, alpha=0.6)
    plt.grid()
(10000L, 23L)

Note that the above plots are different (if you can think of a cleaner way to present this, please send a pull request and answer here!).

위의 도표들은 다른것 들이다. (여러분이 이것을 표현하는 더 명확한 방법 대한 생각 있다면 풀 리퀘스트와 답변을 보내 주세요 여기로!).

We wish to assess how good our model is. "Good" is a subjective term of course, so results must be relative to other models.

우리는 우리의 모델이 얼마나 훌륭한지 주장하기를 원한다. “훌륭하다”는 것은 결과가 다른 모델들에 상대적이어야 해서 물론 주관적인 항목이다.

We will be doing this graphically as well, which may seem like an even less objective method. The alternative is to use Bayesian p-values. These are still subjective, as the proper cutoff between good and bad is arbitrary. Gelman emphasises that the graphical tests are more illuminating [7] than p-value tests. We agree.

우리는 또한 이것을 심지어 덜 객관적인 방법과 같이 보일 수 있는 그래픽으로 작업하게 될 것이다. 대안은 베이지언 p-values를 사용하는 것이다. 이러한 것은 좋고 나쁨 사이의 적절한 컷오프는 임의적인 것으로 여전히 주관적이다. 겔만은 그래픽 테스트들이 p-value 테스트들 보다 더 이해를 돕는 다고 강조한다. 우리는 동의한다.

The following graphical test is a novel data-viz approach to logistic regression. The plots are called separation plots[8]. For a suite of models we wish to compare, each model is plotted on an individual separation plot. I leave most of the technical details about separation plots to the very accessible original paper, but I'll summarize their use here.

다음 그래픽 테스트는 로지스틱 회기분석법을 위한 참신한 데이터 시각화 접근법이다. 도표들은 separation plots[8]이라고 불린다. 우리가 비교하기 원하는 모델들을 위해서 각 모델은 개별적으로 분리된 그래프에 도식화 된다. 나는 바로 접근가능한 분리된 도표들에 대한 대부분의 기술적인 세부사항들은 남겨두지만 나는 여기서 그들의 사용법을 요약할 것이다. original paper

For each model, we calculate the proportion of times the posterior simulation proposed a value of 1 for a particular temperature, i.e. compute $P( \;\text{Defect} = 1 | t, \alpha, \beta )$ by averaging. This gives us the posterior probability of a defect at each data point in our dataset. For example, for the model we used above:

각 모델에 대해서 우리는 사후확률 시뮬레이션이 특정 온도에 대해서 1의 값을 제시했던 시간들의 비율을 계산한다. 즉 평균을 함으로서 $P( \;\text{Defect} = 1 | t, \alpha, \beta )$을 계산한다. 이는 우리의 데이터 세트에 있는 각 데이터 포인트에서 결함의 사후확률을 우리에게 제공한다. 예를들어 우리가 위에서 사용했던 모델:

In [56]:
posterior_probability = simulations.mean(axis=0)
print "posterior prob of defect | realized defect "
for i in range(len(D)):
    print "%.2f                     |   %d" % (posterior_probability[i], D[i])
posterior prob of defect | realized defect 
0.50                     |   0
0.19                     |   1
0.25                     |   0
0.33                     |   0
0.41                     |   0
0.10                     |   0
0.07                     |   0
0.19                     |   0
0.96                     |   1
0.74                     |   1
0.20                     |   1
0.02                     |   0
0.41                     |   0
0.98                     |   1
0.42                     |   0
0.04                     |   0
0.20                     |   0
0.00                     |   0
0.03                     |   0
0.01                     |   0
0.04                     |   1
0.03                     |   0
0.94                     |   1

Next we sort each column by the posterior probabilities:

다음 우리는 사후확률로 각 컬럼을 정렬한다.

In [57]:
ix = np.argsort(posterior_probability)
print "probb | defect "
for i in range(len(D)):
    print "%.2f  |   %d" % (posterior_probability[ix[i]], D[ix[i]])
probb | defect 
0.00  |   0
0.01  |   0
0.02  |   0
0.03  |   0
0.03  |   0
0.04  |   1
0.04  |   0
0.07  |   0
0.10  |   0
0.19  |   1
0.19  |   0
0.20  |   0
0.20  |   1
0.25  |   0
0.33  |   0
0.41  |   0
0.41  |   0
0.42  |   0
0.50  |   0
0.74  |   1
0.94  |   1
0.96  |   1
0.98  |   1

We can present the above data better in a figure: I've wrapped this up into a separation_plot function.

우리는 위의 데이터를 그림으로 더 잘 나타낼 수 있다 : 나는 이것을 separation_plot로 포장 했다.

In [58]:
from separation_plot import separation_plot


figsize(11., 1.5)
separation_plot(posterior_probability, D)

The snaking-line is the sorted probabilities, blue bars denote defects, and empty space (or grey bars for the optimistic readers) denote non-defects. As the probability rises, we see more and more defects occur. On the right hand side, the plot suggests that as the posterior probability is large (line close to 1), then more defects are realized. This is good behaviour. Ideally, all the blue bars should be close to the right-hand side, and deviations from this reflect missed predictions.

꾸불꾸불한 선은 정렬된 확률들이다. 파란색 막대들은 결함들을 나타내고 빈 공간(또는 낙관적인 독자를 위한 회색 막대들)은 무결함들을 나타낸다.확률이 올라감에 따라 우리는 더욱더 많은 결함들이 발생하는 것을 알 수 있다. 오른편에서 도표는 사후확률이 커서(1에 가까운 라인) 그래서 더 많은 결함들이 실현된다는 것을 암시한다. 이는 훌륭한 행동이다. 이상적으로는 모든 파란색 막대들은 오른편에 가까이 있어야 한다. 그리고 이로부터 편차는 손실된 예측들을 반영한다.

The black vertical line is the expected number of defects we should observe, given this model. This allows the user to see how the total number of events predicted by the model compares to the actual number of events in the data.

검은 수직선은 이 모델이 주어진다면 우리가 관측해야 하는 결함들의 기대 개수 이다. 이는 사용자가 모델에 의해 예측되는 이벤트들의 총 개수와 데이터에 있는 이벤트들의 실재 개수와 비교하는 방법을 알게 한다.

It is much more informative to compare this to separation plots for other models. Below we compare our model (top) versus three others:

이것을 다른 모델에 대해서 분리된 도표들과 비교하는 것이 훨씬 더 유용 하다. 아래에서 우리의 모델(맨 위)과 세 개의 다른것들을 비교한다.

  1. the perfect model, which predicts the posterior probability to be equal 1 if a defect did occur.
  2. a completely random model, which predicts random probabilities regardless of temperature.
  3. a constant model: where $P(D = 1 \; | \; t) = c, \;\; \forall t$. The best choice for $c$ is the observed frequency of defects, in this case 7/23.
  1. 결함이 발생 한 경우 사후확률이 1이 되는 것을 예측하는 퍼팩트 모델
  2. 온도와 상관없이 랜덤 확률을 예측하는 완전하게 랜덤한 모델
  3. $P(D = 1 \; | \; t) = c, \;\; \forall t$.인 상수 모델. $c$ 에 대한 최선의 선택은 결함의 관측 빈도 이다 이 경우에 7/23
In [59]:
figsize(11., 1.25)

# Our temperature-dependent model
separation_plot(posterior_probability, D)
plt.title("Temperature-dependent model")

# Perfect model
# i.e. the probability of defect is equal to if a defect occurred or not.
p = D
separation_plot(p, D)
plt.title("Perfect model")

# random predictions
p = np.random.rand(23)
separation_plot(p, D)
plt.title("Random model")

# constant model
constant_prob = 7. / 23 * np.ones(23)
separation_plot(constant_prob, D)
plt.title("Constant-prediction model")
Out[59]:
<matplotlib.text.Text at 0x1ee750f0>

In the random model, we can see that as the probability increases there is no clustering of defects to the right-hand side. Similarly for the constant model.

랜덤 모델에서 우리는 확률이 증가함에 따라 오른편에 결함들의 클러스터링(뭉치기)이 없어지는 것을 알 수 있다.

The perfect model, the probability line is not well shown, as it is stuck to the bottom and top of the figure. Of course the perfect model is only for demonstration, and we cannot infer any scientific inference from it.

퍼팩트 모델에서 확률선이 그림의 바닥과 꼭대기에 갇혀서 잘 보이지 않는다. 물론 퍼팩트 모델은 단지 데모를 위한 것이다. 그리고 우리는 그것으로부터 어떠한 과학적 추론도 추론할 수 없다.

Exercises

1. Try putting in extreme values for our observations in the cheating example. What happens if we observe 25 affirmative responses? 10? 50?

2. Try plotting $\alpha$ samples versus $\beta$ samples. Why might the resulting plot look like this?

In [60]:
# type your code here.
figsize(12.5, 4)

plt.scatter(alpha_samples, beta_samples, alpha=0.1)
plt.title("Why does the plot look like this?")
plt.xlabel(r"$\alpha$")
plt.ylabel(r"$\beta$")
Out[60]:
<matplotlib.text.Text at 0x19299828>

References

  • [1] Dalal, Fowlkes and Hoadley (1989),JASA, 84, 945-957.
  • [2] German Rodriguez. Datasets. In WWS509. Retrieved 30/01/2013, from http://data.princeton.edu/wws509/datasets/#smoking.
  • [3] McLeish, Don, and Cyntha Struthers. STATISTICS 450/850 Estimation and Hypothesis Testing. Winter 2012. Waterloo, Ontario: 2012. Print.
  • [4] Fonnesbeck, Christopher. "Building Models." PyMC-Devs. N.p., n.d. Web. 26 Feb 2013. http://pymc-devs.github.com/pymc/modelbuilding.html.
  • [5] Cronin, Beau. "Why Probabilistic Programming Matters." 24 Mar 2013. Google, Online Posting to Google . Web. 24 Mar. 2013. https://plus.google.com/u/0/107971134877020469960/posts/KpeRdJKR6Z1.
  • [6] S.P. Brooks, E.A. Catchpole, and B.J.T. Morgan. Bayesian animal survival estimation. Statistical Science, 15: 357–376, 2000
  • [7] Gelman, Andrew. "Philosophy and the practice of Bayesian statistics." British Journal of Mathematical and Statistical Psychology. (2012): n. page. Web. 2 Apr. 2013.
  • [8] Greenhill, Brian, Michael D. Ward, and Audrey Sacks. "The Separation Plot: A New Visual Method for Evaluating the Fit of Binary Models." American Journal of Political Science. 55.No.4 (2011): n. page. Web. 2 Apr. 2013.
In [61]:
from IPython.core.display import HTML


def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()
Out[61]:
In [61]: