Archive for the 'probability' category

The Birthday Paradox

Nov 18 2013 Published by under probability

To me, the thing that makes probability fun is that the results are frequently surprising. We've got very strong instincts about how we expect numbers to work. But when you do anything that involves a lot of computations with big numbers, our intuition goes out the window - nothing works the way we expect it to. A great example of that is something called the birthday paradox.

Suppose you've got a classroom full of people. What's the probability that there are two people with the same birthday? Intuitively, most people expect that it's pretty unlikely. It seems like it shouldn't be likely - 365 possible birthdays, and 20 or 30 people in a classroom? Very little chance, right?

Let's look at it, and figure out how to compute the probability.

Interesting probability problems are all about finding out how to put things together. You're looking at things where there are huge numbers of possible outcomes, and you want to determine the odds of a specific class of outcomes. Finding the solutions is all about figuring out how to structure the problem.

A great example of this is something called the birthday paradox. This is a problem with a somewhat surprising outcome. It's also a problem where finding the right way to structure the problem is has a dramatic result.

Here's the problem: you've got a group of 30 people. What's the probability that two people out of that group of thirty have the same birthday?

We'll look at it with some simplifying assumptions. We'll ignore leap year - so we've got 365 possible birthdays. We'll assume that all birthdays are equally likely - no variation for weekdays/weekends, no variation for seasons, and no holidays, etc. Just 365 equally probable days.

How big is the space? That is, how many different ways are there to assign birthdays to 30 people? It's 36530 or something in the vicinity of 7.4*1076.

To start off, we'll reverse the problem. It's easier to structure the problem if we try to ask "What's the probability that no two people share a birthday". If P(B) is the probability that no two people share a birthday, then 1-P(B) is the probability that at least two people share a birthday.

So let's look at a couple of easy cases. Suppose we've got two people? What's the odds that they've got the same birthday? 1 in 365: there are 3652 possible pairs of birthdays; there are 365 possible pairs. So there's a probability of 365/3652 that the two people have the same birthday. For just two people, it's pretty easy. In the reverse form, there's a 364/365 chance that the two people have different birthdays.

What about 3 people? It's the probability of the first two having different birthdays, and the probability of the third person having a different birthday that either of those first two. There are 365 possible birthdays for the third person, and 363 possible days that don't overlay with the first two. So for N people, the probability of having distinct birthdays is \(1 times (1 - 1/365) times (1 - 2/365) times dots (1 - (n/365))\).

At this point, we've got a nice recursive definition. Let's say that \(f(N)\) is the probability of \(N\) people having distinct birthdays. Then:

  1. For 2 people, the probability of distinct birthdays is 364/365. (\(f(2) = frac{364}{365}\))
  2. For N>2 people, the probability of distinct birthdays is
    \(frac{365-(N-1)}{365} times f(n-1)\).

Convert that to a closed form, and you get: \(f(n) = frac{365!}{(365-(n-1))!365^n}\). For 30 people, that's
\(frac{365!}{(365-29)!*365^{30}}\). Work it out, and that's
0.29 - so the probability of everyone having distinct
birthdays is 29% - which means that the probability of at least
two people in a group of 30 having the same birthday is 71%!

You can see why our intuitions are so bad? We're talking about something where one factor in the computation is the factorial of 365!

Let's look a bit further: how many people do you need to have, before there's a 50% chance of 2 people sharing a birthday? Use the formulae we wrote up above, and it turns out to be 23. Here's the numbers - remember that this is the reverse probability, the probability of all birthdays being distinct.

1 1
2 0.997260273973
3 0.991795834115
4 0.983644087533
5 0.9728644263
6 0.959537516351
7 0.943764296904
8 0.925664707648
9 0.905376166111
10 0.883051822289
11 0.858858621678
12 0.832975211162
13 0.805589724768
14 0.776897487995
15 0.747098680236
16 0.716395994747
17 0.684992334703
18 0.653088582128
19 0.620881473968
20 0.588561616419
21 0.556311664835
22 0.524304692337
23 0.492702765676
24 0.461655742085
25 0.431300296031
26 0.401759179864
27 0.373140717737
28 0.345538527658
29 0.319031462522
30 0.293683757281
31 0.269545366271
32 0.24665247215
33 0.225028145824
34 0.20468313538
35 0.185616761125
36 0.16781789362
37 0.151265991784
38 0.135932178918
39 0.121780335633
40 0.108768190182
41 0.0968483885183
42 0.0859695284381
43 0.0760771443439
44 0.0671146314486
45 0.0590241005342
46 0.0517471566327
47 0.0452255971667
48 0.0394020271206
49 0.0342203906773
50 0.029626420422

With just 23 people, there's a greater than 50% chance that two people will have the same birthday. By the time you get to just 50 people, there's a greater than 97% chance that two people have the same birthday!

As an amusing aside, the first time I saw this problem worked through was in an undergraduate discrete probability theory class, with 37 people in the class, and no duplicate birthdays!

Now - remember at the beginning, I said that the trick to working probability problems is all about how you formulate the problem. There's a much, much better way to formulate this.

Think of the assignment of birthdays as a function from people to birthdays: \(f: P rightarrow B\). The number of ways of assigning birthdays to people is the size of the set of functions from people to birthdays. How many possible functions are there? \(| B | ^{| P |}\). \(| B |\) is the number of days in the year - 365, and \(| P |\) is the number of people in the group.

The set of assignments to unique birthdays is the number of injective functions. (An injective function is a function where \(f(x) = f(y) Leftrightarrow x = y\).) How many injective functions are there? \(frac{| B |!}{(| B | - | P |)!}\).

The probability of all birthdays being unique is the size of the set of injective functions divided by the size of the set of all assignments: \(frac{frac{| B |!}{(| B | - | P |)!}}{ | B | ^{| P |} } = frac{365!}{365^Ptimes (365 - P)!}\).

So we've got the exact same result - but it's a whole lot easier in term of the discrete functions!

21 responses so far

Combining Non-Disjoint Probabilities

Sep 29 2013 Published by under probability

In my previous post on probability, I talked about how you need to be careful about covering cases. To understand what I mean by that, it's good to see some examples.

And we can do that while also introducing an important concept which I haven't discussed yet. I've frequently talked about independence, but equally important is the idea of disjointness.

Two events are independent when they have no ability to influence one another. So two coin flips are independent. Two events are disjoint when they can't possibly occur together. Flipping a coin, the event "rolled a head" and the event "rolled a tail" are disjoint: if you rolled a head, you can't roll a tail, and vice versa.

So let's think about something abstract for a moment. Let's suppose that we've got two events, A and B. We know that the probability of A is 1/3 and the probability of B is also 1/3. What's the probability of A or B?

Naively, we could say that it's P(A) + P(B). But that's not necessarily true. It depends on whether or not the two events are disjoint.

Suppose that it turns out that the probability space we're working in is rolling a six sided die. There are three basic scenarios that we could have:

  1. Scenario 1: A is the event "rolled 1 or 2", and B is "rolled 3 or 4". That is, A and B are disjoint.
  2. Scenario 2: A is the event "rolled 1 or 2", and B is "rolled 2 or 3". A and B are different, but they overlap.
  3. Scenario 3: A is the event "rolled 1 or 2", and B is the event "rolled 1 or 2". A and B are really just different names for the same event.

In scenario one, we've got disjoint events. So P(A or B) is P(A) + P(B). One way of checking that that makes sense is to look at how the probability of events work out. P(A) is 1/3. P(B) is 1/3. The probability of neither A nor B - that is, the probability of rolling either 5 or 6 - is 1/3. The sum is 1, as it should be.

But suppose that we looked at scenario 2. If we made a mistake and added them as if they were disjoint, how would things add up? P(A) is 1/3. P(B) is 1/3. P(neither A nor B) = P(4 or 5 or 6) = 1/2. The total of these three probabilities is 1/3 + 1/3 + 1/2 = 7/6. So just from that addition, we can see that there's a problem, and we did something wrong.

If we know that A and B overlap, then we need to do something a bit more complicated to combine probabilities. The general equation is:

\[ P(A cup B) = P(A) + P(B) - P(A cap B)\]

Using that equation, we'd get the right result. P(A) = 1/3; P(B) =
1/3; P(A and B) = 1/6. So the probability of A or B is 1/3 + 1/3 - 1/6 = 1/2. And P(neither A nor B) = P(4 or 5 or 6) = 1/2. The total is 1, as it should be.

From here, we'll finally start moving in to some more interesting stuff. Next post, I'll look at how to use our probability axioms to analyze the probability of winning a game of craps. That will take us through a bunch of applications of the basic rules, as well as an interesting example of working through a limit case.

And then it's on to combinatorics, which is the main tool that we'll use for figuring out how many cases there are, and what they are, which as we've seen is an essential skill for probability.

2 responses so far

Correction, Sigma Algebras, and Mass functions

Sep 02 2013 Published by under probability

So, I messed up a bit in the previous post. Let me get that out of the way before we move forward!

In measure theory, you aren't just working with sets. You're working with something called σ-algebras. It's a very important distinction.

The problem is, our intuition of sets doesn't always work. Sets, as defined formally, are really pretty subtle. We expect certain things to be true, because they make sense. But in fact, they are not implied by the definition of sets. A σ-algebra is, essentially, a well-behaved set - a set whose behavior matches our usual expectations.

To be formal, a sigma algebra over a set S is a collection Σ of subsets of S such that:

  1. Σ is closed over set complement.
  2. Σ is closed over countable union.

The reason why you need to make this restriction is, ultimately, because of the axiom of choice. Using the axiom of choice, you can create sets which are unmeasurable. They're clearly subsets of a measurable set, and supersets of other measurable sets - and yet, they are, themselves, not measurable. This leads to things like the Banach-Tarski paradox: you can take a measurable set, divide it into non-measurable subsets, and then combine those non-measurable subsets back into measurable sets whose size seem to make no sense. You can take a sphere the size of a baseball, slice it into pieces, and then re-assemble those pieces into a sphere the size of the earth, without stretching them!

These non-measurable sets blow away our expectations about how things should behave. The restriction to σ algebras is just a way of saying that we need to be working in a space where all sets are measurable. When we're looking at measure theory (or probability theory, where we're building on measures), we need to exclude non-measurable sets. If we don't, we're seriously up a creek without a paddle. If we allowed non-measurable sets, then the probability theory we're building would be inconsistent, and that's the kiss of death in mathematics.

Ok. So, with that out of the way, how do we actually use Kolmogorov's axioms? It all comes down to the idea of a sample space. You need to start with an experiment that you're going to observe. For that experiment, there are a set of possible outcomes. The set of all possible outcomes is the sample space.

Here's where, sadly, even axiomatized probability theory gets a bit handwavy. Given the sample space, you can define the structure of the sample space with a function, called the probability mass function, f, which maps each possible event in the sample space to a probability. To be a valid mass function for a sample space S, it's got to have the following properties:

  1. For each event e in S, f(e) ≥ 0 and f(e) <= 1..
  2. The sum of the probabilities in the sample space must be 1: \(Sigma_{e in S} f(e) = 1\)

So we wind up with a sort of circularity: in order to describe the probability of events, we need to start by knowing the probability of events. In fact, this isn't really a problem: we're talking about taking something than we observe in the real world, and mapping it into the abstract space of math. Whenever we do that, we need to take our observations of the real world and create an approximation as a mathematical model.

The point of probability theory isn't to do that primitive mapping. In general, we already understand how rolling a single die works. We know how it should behave, and we know how and why its actual behavior can vary from our expectation. What we want to know is really how many events combine.

We don't need any special theory to figure out what the probability of rolling a 3 on a six-sided die is: that's easy, and it's obvious: it's 1 in 6. But what's the probability of winning a game of craps?

If all days of the year 2001 are equally likely, then we don't need anything fancy to ask what the probability of someone born in 2001's birthday being July 21st. It's easy: 1 in 365. But if I've got a group of 35 people, what's the probability of two of them sharing the same birthday?

Both of those questions start with the assignment of a probability mass function, which is trivial. But they involve combining the probabilities given by those mass functions, and use them with Kolmogorov's axioms to figure out the probabilities of the complicated events.

2 responses so far

Kolmogorov's Axioms of Probability

Aug 24 2013 Published by under probability

The way that I've talked about probability so far is mostly informal. That's the way that probability theory was treated for a long time. You defined probability spaces over collections of equal probability sets. You combined probability spaces by combining their events into other kinds of equally probable events.

The problem with that should be obvious: it's circular. You want to define the probability of events; to do that, you need to start with equally probable events, which means that on some level, you already know the probabilities. If you don't know the probabilities, you can't talk about them. The reality is somewhat worse than that, because this way of looking at things completely falls apart when you start trying to think about infinite probability spaces!

So what can you do?

The answer is to reformulate probability. Mathematicians knew about this kind of problem for a very long time, but what they mostly just ignored it: probability wasn't considered a terribly interesting field.

Then, along came Kolmogorov - the same brilliant guy who's theory of computational complexity is so fascinating to me! Kolmogorov created a new formulation of probability theory. Instead of starting with a space of equally probable discrete events, you start with a measure space.

Before we can look at how Kolmogorov reformulated probability (the Kolmogorov axioms), we need to look at just what a measure space is.

A measure space is just a set with a measure function. So let X be a set. A measure μ on X is a function from a subset of X to a real number: \(mu: 2^X rightarrow R\) with the following properties:

  • Measures are non-negative: \(forall x subseteq X: mu(x) ge 0\)
  • The measure of the empty set is always 0: \(mu(emptyset) = 0\)
  • The measure of a finite sequence of unions is the sum of the individual measures\(mu(x + y) = mu(x) + mu(y)\)

So the idea is pretty simple: a measure space is just a way of defining the size of a subset in a consistent way.

To work with probability, you need a measure space where the measure of the entire set is 1. With that idea in mind, we can put together a proper, formal definition of a probability space that will really allow us to work with, and to combine probabilities in a rigorous way.

Like our original version, a probability space has a set of events, called its event space. We'll use F to represent the set of all possible events, and e to represent an event in that set.

There are three fundamental axioms of probability, which are going to look really similar to the three axioms of a measure space:

  1. Basic measure: the probability of any event is a positive real number: \(forall e in E: P(e) >= 0\).
  2. Unit measure: the probability that some event will occur is 1, which we write as \(P(Omega)=1\) (\(Omega\) is called the unit event, and is the union of all possible events.) Alternatively, the probability of no event occurring is 0: \(P(emptyset)=0\).
  3. Combination: For any two distinct events or sets of events \(e\) and \(f\), the probability of \(e\) or \(f\) is \(P(e) + P(f)\): \(forall e, f subseteq P: e cap f = emptyset Rightarrow P(e cup f) = P(e) + P(f)\). This can be extended to any countable sequence of unions.

This is very similar to the informal version we used earlier. But as we'll see later, this simple formulation from measure theory will give us a lot of additional power.

It's worth taking a moment to point out two implications of these axioms. (In fact, I've seen some presentations that treat some of these as additional axioms, but they're provable from the first three.

  • Monotonicity: if \(e subeq f\), then \(P(e) le P(f)\).
  • Upper Bound: for any event or set of events \(e\), \(P(e) ge 0 land P(e) le 1\).

The brilliance of Kolmogorov was realizing that these rules were everything you need to work out any probability you want - in both finite and infinite spaces. We'll see that there's a lot of complexity in the combinatorics of probability, but it will all always ultimately come back to these three rules.

6 responses so far

Independence and Combining probabilities.

Jun 24 2013 Published by under probability

As I alluded to in my previous post, simple probability is really a matter of observation, to produce a model. If you roll a six sided die over and over again, you'll see that each face comes up pretty much equally often, and so you model it as 1/6 probability for each face. There's not really a huge amount of principle behind the basic models: it's really just whatever works. This is the root of the distinction between interpretations: a frequentist starts with an experiment, and builds a descriptive model based on it, and says that the underlying phenomena being tested has the model as g, a property; a Bayesian does almost the same thing, but says that the model describes the state of their knowledge.

Where probability starts to become interesting in when you combine things. I know the probability of outcomes for rolling one die: how can I use that to see what happens when I roll five dice together? I know the probability of drawing a specific card from a deck: what are the odds of being dealt a particular poker hand?

We'll start with the easiest part: combining independent probabilities. The probability of two events are independent when there's no way for the outcome of one to influence the outcome of the other. For example, if you're flipping a coin several times, the result of one coin flip has no effect on the result of a subsequent flip. On the other hand, dealing 10 cards from a deck is a sequence of dependent events: once you've dealt one card, the next deal must come from the remaining cards: you can't deal the same card twice.

If you know the probability space of your trials, then recognizing an independent situation is easy: if the outcome of one trial doesn't alter the probability space of other trials, then they're independent.

Look back at the coin flip example: we know what the probability space of a coin flip looks like: it's got two, equally probable outcomes. If you've flipped a coin once, and you're going to flip another coin, the result of the first flip can't do anything that alters the probability space of a subsequent flip.

But if you think about dealing cards, that's not true. With a standard deck of cards, the initial probability space has 52 outcomes, each of which is equally likely. So the odds of being dealt the 5 of spades is exactly 1/52.

Now, suppose that you got lucky, and you did get dealt the 5 of spades on the first card. What's the probability of being dealt the 5 of spades's on the second? If they were independent events, it would still be 1/52. But once you've dealt one card, you can't deal it again. The probability of being dealt the 5 of spades as the second card is 0: it's impossible. The probability space only has 51 possible outcomes, and the 5 of spades is not one of them. The space has changed. That's the definition of a dependent event.

When you're faced with dependent probabilities, you need to figure out how the probability space will be changed, and incorporate that into your computation. Once you've incorporated the change in the probability space of the second test, then you've got a new independent probability, and you can combine them. Figuring out how to alter the probability space can be extremely difficult, but that's what makes it interesting.

When you're dealing with independent events, it's easy to combine them. There are two basic ways of combining event probabilities,
and they should be familiar from logic: event1 AND event2, and event1 OR event2.

Suppose you're looking at two test with independent outcomes. I know that the probability of event e is P(e), and the probability of event f is P(f) Then the outcome of e & f - that is, of having e as the outcome of the first trial, and f as the outcome of the second, is P(e)×P(f). The odds of rolling HTTH on a coin is (1/2)*(1/2)*(1/2)*(1/2)=(1/16).

If you're looking at independent alternatives - that is, the probability of e OR F, you combine the probabilities of the event with addition: P(e) + P(f). So, the odds of drawing any heart from a deck: for each card, it's 1/52. There are thirteen different hearts. So the odds of drawing a red are 1/52 + 1/52 + ... = 13/52 = 1/4.

That still doesn't get us to the really interesting stuff. We still can't quite work out something like the odds of being dealt a flush. To get there, we need to learn some combinatorics, which will allow us to formulate the probability spaces that we need for an interesting probability.

7 responses so far

Probability Spaces

Jun 19 2013 Published by under probability

Sorry for the slowness of the blog lately. I finally got myself back onto a semi-regular schedule when I posted about the Adria Richards affair, and that really blew up. The amount of vicious, hateful bile that showed up, both in comments (which I moderated) and in my email was truly astonishing. I've written things which pissed people off before, and I've gotten at least my fair share of hatemail. But nothing I've written before came close to preparing me for the kind of unbounded hatred that came in response to that post.

I really needed some time away from the blog after that.

Anyway, I'm back, and it's time to get on with some discrete probability theory!

I've already written a bit about interpretations of probability. But I haven't said anything about what probability means formally. When I say that the probability of rolling a 3 with a pair of fair six-sided dice is 1/18, how do I know that? Where did that 1/6th figure come from?

The answer lies in something called a probability space. I'm going to explain the probability space in frequentist terms, because I think that that's easiest, but there is (of course) an equivalent Bayesian description.) Suppose I'm looking at a particular experiment. In classic mathematical form, a probability space consists of three components (Ω, E, P), where:

  1. Ω, called the sample space, is a set containing all possible outcomes of the experiment. For a pair of dice, Ω would be the set of all possible rolls: {(1,1), (1,2), (1,3), (1,4), (1,5), (1, 6), (2,1), ..., (6, 5), (6,6)}.
  2. E is an equivalence relation over Ω, which partitions Ω into a set of events. Each event is a set of outcomes that are equivalent. For rolling a pair of dice, an event is a total - each event is the set of outcomes that have the same total. For the event "3" (meaning a roll that totalled three), the set would be {(1, 2), (2, 1)}.
  3. P is a probability assignment. For each event e in E, P(e) is a value between 0 and 1, where:

    \[ Sigma_{ein E} P(e) = 1\]

    (That is, the sum of the probabilities of all of the possible events in the space is exactly 1.)

The probability of an event e being the outcome of a trial is P(e).

So the probability of any particular event as the result of a trial is a number between 0 and 1. What's it mean? If the probability of event e is p, then if we repeat the trial N times, we expect N*p of those trials to have e as their result. If the probability of e is 1/4, and we repeat the trial 100 times, we'd expect e to be the result 25 times.

But in an important sense, that's a cop-out. We've defined probability in terms of this abstract model, where the third component is the probability. Isn't that circular?

Not really. For a given trial, we create the probability assignment by observation and/or analysis. The important point is that this is really just a bare minimum starting point. What we really care about in probability isn't the change associated with a single, simple, atomic event. What we want to do is take the probability associated with a group of single events, and use our understanding of that to allow us to explore a complex event.

If I give you a well-shuffled deck of cards, it's easy to show that the odds of drawing the 3 of diamonds is 1/52. What we want to do with probability is things like ask: What are the odds of being dealt a flush in a poker hand?

The construction of a probability space gives us a well-defined platform to use for building probabilistic models of more interesting things. Give a probability space of two single dice, we can combine them together to create the probability space of the two dice rolled together. Given the probability space of a pair of dice, we can construct the probability space of a game of craps. And so on.

13 responses so far

Probability and Interpretations

May 12 2013 Published by under probability

I'm going to do some writing about discrete probability theory. Probability is an extremely important area of math. We encounter aspects of it every day. It's also a very poorly understood area - it's one that we see abused or just fouled up every day.

I'm going to focus on discrete probability theory. What that means is that we're going to look at things where the space containing the things that we're going to look at contains a countable number of elements. The probability of getting a certain sequence of coin flips, or of getting a certain hand of cards are described by discrete probability theory. On the other hand, the odds of a radioactive isotope decaying at a particular time requires continuous probability theory.

Before getting into the details, there's one important thing to mention. When you're talking about probability, there are two fundamental schools of interpretetation. There are frequentist interpretations, and there are Bayesian interpretations.

In a frequentist interpretation, when you say the probability of an event is 0.6, what you mean is that if you were to perform a series of experiments precisely reproducing the event, then on average, if you did 100 experiments, the event would occur 60 times. In the frequentist interpretation, the probability is an intrinsic property of the event. For a frequentist, it makes sense to say that there is a "real" probability associated with an event.

In a Bayesian interpretation, when you say that the probability of an event is 0.6, what you mean is that based on your current state of knowledge about the event, you have a 60% certainty that the event will occur. In a strict Bayesian interpretation, the event doesn't have any kind of intrinsic probability associated with it. The specific event that you're interested in either will occur, or it won't. There's no real probability involved. What probability measures is how certain you are about whether or not it will occur.

For example, think about flipping a fair coin.

A frequentist would say that you can flip a coin many times, and half of the time, it will land on heads. So the probability of a coin flip landing on the head of the coin is 0.5. A Bayesian would say that the coin will land either on heads or on tails. Since you don't know which, and you have no other information to use to be able to make a better prediction, you can have a certainty of 0.5 that it will land on the head of the coin.

In the real world, I think that most people are really somewhere in between.

I think that all but the most fervent Bayesians do rely on an intuitive notion of the "intrinsic" probability of an event. They may describe it in different terms, but when it comes down to it, they're using the basic frequentist notion. And I don't think that you can find a sane frequentist anywhere who won't use Bayes theorem to update their priors in the face of new information - which is the most fundamental notion in the Bayesian interpretation.

One note before I finish this, and get started on the real meaty posts. In the past, when I've talked about probability, people have started stupid flamewars in the comments. People get downright religious about interpretations of probability. There are religious Bayesians, who think that all frequentists are stupid idiots who should be banished from the field of math; likewise, there are religious frequentists who think that Bayesians are all a crop of arrogant know-it-alls who should be sent to Siberia. I am not going to tolerate any of that nonsense. If you feel that you cannot read posts on probability without going into a diatribe about those stupid frequentists/Bayesians and their deliberately stupid ideas, please go away and don't even read these posts. If you do go into such a diatribe, I will delete your comments without any hesitation.

23 responses so far