The way that I've talked about probability so far is mostly informal. That's the way that probability theory was treated for a long time. You defined probability spaces over collections of equal probability sets. You combined probability spaces by combining their events into other kinds of equally probable events.

The problem with that should be obvious: it's circular. You want to define the probability of events; to do that, you need to start with equally probable events, which means that on some level, you already know the probabilities. If you don't know the probabilities, you can't talk about them. The reality is somewhat worse than that, because this way of looking at things completely falls apart when you start trying to think about infinite probability spaces!

So what can you do?

The answer is to reformulate probability. Mathematicians knew about this kind of problem for a very long time, but what they mostly just ignored it: probability wasn't considered a terribly interesting field.

Then, along came Kolmogorov - the same brilliant guy who's theory of computational complexity is so fascinating to me! Kolmogorov created a new formulation of probability theory. Instead of starting with a space of equally probable discrete events, you start with a *measure space*.

Before we can look at how Kolmogorov reformulated probability (the *Kolmogorov axioms*), we need to look at just what a measure space is.

A measure space is just a set with a *measure function*. So let **X** be a set. A *measure* μ on **X** is a function from a subset of **X** to a real number: \(mu: 2^X rightarrow R\) with the following properties:

- Measures are non-negative: \(forall x subseteq X: mu(x) ge 0\)
- The measure of the empty set is always 0: \(mu(emptyset) = 0\)
- The measure of a finite sequence of unions is the sum of the individual measures\(mu(x + y) = mu(x) + mu(y)\)

So the idea is pretty simple: a measure space is just a way of defining the *size* of a subset in a consistent way.

To work with probability, you need a measure space where the measure of the entire set is 1. With that idea in mind, we can put together a proper, formal definition of a probability space that will really allow us to work with, and to combine probabilities in a rigorous way.

Like our original version, a probability space has a set of events, called its *event space*. We'll use **F** to represent the set of all possible events, and *e* to represent an event in that set.

There are three fundamental axioms of probability, which are going to look *really* similar to the three axioms of a measure space:

- Basic measure: the probability of any event is a positive real number: \(forall e in E: P(e) >= 0\).
- Unit measure: the probability that
*some*event will occur is 1, which we write as \(P(Omega)=1\) (\(Omega\) is called the*unit event*, and is the union of all possible events.) Alternatively, the probability of no event occurring is 0: \(P(emptyset)=0\). - Combination: For any two distinct events or sets of events \(e\) and \(f\), the probability of \(e\) or \(f\) is \(P(e) + P(f)\): \(forall e, f subseteq P: e cap f = emptyset Rightarrow P(e cup f) = P(e) + P(f)\). This can be extended to any
*countable*sequence of unions.

This is very similar to the informal version we used earlier. But as we'll see later, this simple formulation from measure theory will give us a lot of additional power.

It's worth taking a moment to point out two implications of these axioms. (In fact, I've seen some presentations that treat some of these as additional axioms, but they're provable from the first three.

- Monotonicity: if \(e subeq f\), then \(P(e) le P(f)\).
- Upper Bound: for any event or set of events \(e\), \(P(e) ge 0 land P(e) le 1\).

The brilliance of Kolmogorov was realizing that these rules were *everything you need* to work out any probability you want - in both finite and infinite spaces. We'll see that there's a lot of complexity in the combinatorics of probability, but it will all always ultimately come back to these three rules.

Just to be pedantic, 'cause that's what I do: the unions above need to be *disjoint* unions. Otherwise you get silliness like

X = X ∪ X

so

1 = μ(X) = μ(X∪X) = μ(X)+μ(X) = 1+1 = 2

which is not so good.

Also you probably want "disjoint" instead of "distinct" sets of events in rule 3.

Measure of DISJOINT unions adds. And measures are defined on sigma-subalgebras of the power set, not the power set itself in general

It looks like you defined measure over the power set of X (mu:2^X->R). But under standard assumptions, aren't there non-measurable sets?

There's a problem with that first formula under Monotonicity. Presumably it's meant to be e is a subset of f.

"We'll use F to represent the set of all possible events, and e to represent an event in that set."

I think you mean E

also, " forall_{e,f} subseteq P ", should be "subseteq E", in axion three. right?