I try to lay out the basic bits1 of probability theory which I find especially useful for epistemology. It’s a bit more technical than what I’ve seen online, but I hope it’s still fairly accessible; my target audience is high-school me. For an interesting alternative exposition, see Jaynes’ Probability Theory: The Logic of Science (Ch. 1-2).2
I also try to show that our primitive talk about beliefs and knowledge shouldn’t become deprecated. I should note that this later section is heavily influenced by Williamson’s Knowledge and Its Limits and by general exposure to Williamsonian Thought (e.g., via the formal epistemology graduate seminar at Oxford).
Probability space
Let W be a set. For familiarity, call its elements worlds, and its subsets propositions. A proposition φ is true at some world w just in case w is in φ. Thus, we’ve identified each proposition with the set of worlds at which it’s true.
Our logical operators (⊤, ⊥, ¬, ∧, ∨) correspond to the set operators (W, ∅, C, ⋂, ⋃). They work as expected:
⊤ is W, which contains all worlds and so is true at all worlds.
⊥ is the empty set, which contains no world and so is true at no worlds.
¬φ is the set of worlds not in φ. So, ¬φ is true at w just in case φ is not true at w.
(φ ∧ ψ) is the set of worlds in both φ and ψ. So, (φ ∧ ψ) is true at w just in case φ and ψ are both true at w.
(φ ∨ ψ) is the set of worlds in either φ or ψ. So, (φ ∨ ψ) is true at w just in case either φ or ψ is true at w.
Let Σ be a logically closed set of propositions. In particular, the result of applying countably many logical operations to propositions in Σ must also be a proposition in Σ.3 Then, Σ is a σ-algebra on W, and the pair (W, Σ) is a measurable space. The smallest σ-algebra is {⊥, ⊤}, and the largest is the set 𝒫(W) of all propositions. We can think of Σ as a set of learnable propositions.
Now, let M be a function from Σ to the (extended) real numbers such that:
For all φ, M(φ) ≥ 0.
If φ ∧ ψ = ⊥, then M(φ ∨ ψ) = M(φ) + M(ψ).4
Then, M is a measure on (W, Σ), and the triple (W, Σ, M) is a measure space.
If 0 < M(W) < ∞, then we can define the function P as P(φ) = M(φ) / M(W) for all φ in F. Notice that P is a measure on (W, Σ), with P(W) = 1. So, P is a probability measure on (W, Σ), and (W, Σ, P) is a probability space with W as the sample space and Σ as the event space. For any event φ in Σ, P(φ) is the probability of φ. We can deduce familiar properties of P, e.g.:
P(φ ∨ ψ) = P(φ) + P(ψ) - P(ψ ∧ φ)
P(¬φ) = 1 - P(φ)
P(⊥) = 0
Now, we’ll take P as a prior probability function, and see what happens when we know things.
Conditional probability
Let Ω be the proposition in Σ containing the worlds which, for all you know, you might be in. These worlds are epistemically accessible.5 You know that φ just in case φ is true in all epistemically accessible worlds (i.e, all worlds in Ω are also in φ). We can think of Ω as (φ1 ∧ φ2 ∧ …), where φ1, φ2, … are the propositions that you know. So, Ω is a proposition representing all of your background knowledge.
We want to use probability to reason about propositions that, for all we know, may or may not be true. So, we’d like to use Ω as our sample space, throwing out all the worlds which we know can’t be actual.
This means that for each proposition φ, we only care about the subset φ ∧ Ω. This induces the event space ΣΩ, which contains φ ∧ Ω for each φ in Σ. Notice that ΣΩ is a subset of Σ. It’s also a σ-algebra on Ω, so we have the measurable space (Ω, ΣΩ).
Happily, P (with its domain restricted to ΣΩ) is a measure on (Ω, ΣΩ). Unfortunately, this isn’t a probability measure on (Ω, ΣΩ) if P(Ω) ≠ 1. But, if P(Ω) > 0, we can get a probability measure PΩ, defined as PΩ(φ) = P(φ) / P(Ω) for each φ in ΣΩ. So, our knowledge Ω induces the probability space (Ω, ΣΩ, PΩ).
Although we’ve defined PΩ as a probability measure on (Ω, ΣΩ), we can also use it as a probability measure on (W, Σ). The key characteristic of our restricted event space ΣΩ is that each event φ is equivalent to φ ∧ Ω.6 That is, PΩ(φ) = P(φ ∧ Ω) / P(Ω) for each φ in ΣΩ. So, we can extend PΩ by defining P(φ | Ω) = P(φ ∧ Ω) / P(Ω) for each φ in Σ. These are the probabilities that would be induced by knowing exactly Ω, i.e., if we threw out all the worlds not in Ω. These are conditional or posterior probabilities; I’ll also call them subjective probabilities.7
It’s easy to see what happens when we gain or lose knowledge. Suppose that we learn ψ, such that our total knowledge becomes Ω ∧ ψ. Then, the probability of φ given our new total knowledge is P(φ | Ω ∧ ψ). If we forget ψ, then our total knowledge reverts to Ω, so the probability of φ on our total knowledge reverts to P(φ | Ω). Simple!
Credence-talk doesn’t replace belief-talk
Why bother with restricted probability spaces, instead of just using conditional probability measures on (W, Σ)? Well, notice that P(φ) = 0 doesn’t entail φ = ⊥, and P(φ) = 1 doesn’t entail φ = ⊤, although the converse holds in both cases. Restricting our sample space allows us to capture the distinction between having subjective probability zero and being subjectively impossible, or having subjective probability one and being subjectively certain. For instance, if I know that I’ll observe a countable sequence of fair coin tosses, then I have epistemic probability one that I’m not going to see only heads. Given what I know, I almost surely won’t see all heads; but almost sure isn’t sure. Even with subjective probability one, I don’t yet know that I’m going to see tails.
If one neglects knowledge-talk, then one is liable to make odd mistakes. For instance, one might assume that once φ has subjective probability one, it can never again have a subjective probability lower than one (unless I somehow lose evidence). But then, assigning subjective probability one to any substantive claim seems wildly overconfident: how can I be so sure in φ that there is literally no evidence could make me even a little bit less confident? But, whenever I conditionalize on some proposition, I must assign it a posterior probability of one!
The basic distinction between belief and knowledge helps. Outright belief is necessary but insufficient for knowledge. That is, let O be the proposition in Σ containing the worlds which, for all I outright believe, I might be in. These worlds are doxastically accessible. We can think of O as (β1 ∧ β2 ∧ …), where {β1, β2, …} are the propositions that you outright believe. As above, we can think of Ω as (φ1 ∧ φ2 ∧ …), where {φ1, φ2, …} are the propositions that you know. So, the claim is that each proposition in {φ1, φ2, …} is also in {β1, β2, …}, but not vice versa; equivalently, each world in O is also in Ω, but not vice versa. We can distinguish two types of subjective probability spaces: doxastic (O, ΣO, PO) and epistemic (Ω, ΣΩ, PΩ). My credence in φ is P(φ | O), and my evidential probability in φ is P(φ | Ω).
I have credence one in outright beliefs, but can still change my mind on them. To take a simple example, I might learn something that full-on contradicts my beliefs. Suppose I learn φ, but O ∧ φ = ⊥. This pushes my doxastic state to (⊥, {⊥, ⊤}, ⊥). Clearly, this is unacceptable. To make my beliefs non-contradictory, I must give up each belief βi such that βi ∧ φ = ⊥. In fact, each βi goes from credence one to credence zero. Having credence one did not indicate rabid dogmatism. Note that if we neglected to think about outright beliefs, it would be hard to understand how I could condition on a subjectively impossible proposition: the conditional credences are undefined, as they would require dividing by zero.
The moral is that new evidence can dislodge old beliefs, casting them down from subjective certainty. In general, credence is not confidence. Analogously, evidential probability is not reliability. Whereas credence is low chance of error among doxastically accessible worlds, confidence is best modeled as remote chance of error from doxastically accessible worlds. Formally, impose a metric on W, and take confidence in φ as the longest distance c such that φ contains any world within distance c of some doxastically accessible world. If my credence in φ is less than one, then I don’t outright believe φ, and the confidence of my belief in φ is undefined. Confident belief (and reliable knowledge) includes things that I’ve observed, where the chance of error is remote; unconfident belief (and unreliable knowledge) includes things that I’ve inferred, where the chance of error is nearby. The primitive distinction between high-confidence and low-confidence beliefs is illuminating. For instance, we see that the practice of putting (non-maximal) credences on beliefs is really a sign of high discursive standards: only the high-confidence beliefs are allowed to be assumed. But in some contexts, we can and should have lower standards; I may rely on everyday assumptions, but come to question them when my doxastic state shrinks enough to throw up alarm bells or when a friend starts to pester me about them. Although I haven’t displayed any calculations here, this of course makes reasoning much more tractable. Finally, note that there is still a strong correspondence between credence as confidence: the less confident that I am in a belief, the sooner my credence in it drops from 1 as the evidential standards rise.
Coda: Bayes’ Theorem
By definition, P(φ|Ω) = P(φ ∧ Ω) / P(Ω). Rearranging, P(φ ∧ Ω) = P(φ|Ω) P(Ω). Swapping φ and Ω, we have P(φ ∧ Ω) = P(Ω ∧ φ) = P(Ω|φ) P(φ). Inserting this into our definition, we have Bayes’ Theorem:
This is quite useful! We can find the probability induced by our evidence Ω on some hypothesis φ, using the probability which φ would induce on Ω. But notice that our discussion above didn’t actually involve it; conditional probabilities are the more fundamental tool.
Of course, I haven’t really tried to demonstrate the power of either; but sometimes neat formalization is its own reward.
The internet Bayesians are right to think highly of that book. However, note that the discussions about Jeffreys and Bertrand, although very insightful, are not the final word.
(Note that we include ⊥ and ⊤ as nullary operators.) Equivalently, we can require that Σ contains the empty set, that Σ is closed under complementation with respect to W, and that Σ is closed under countable unions.
Really, we want countable additivity: M(φ1 ∨ φ2 ∨ …) = M(φ1) + M(φ2) + … for each subset {φ1, φ2, …} of Σ that has φi ∧ φj = ⊥ for each pair φi and φj. (If we only want to allow finite-length sentences, we can write the big disjunction as a countable union instead.)
Although I won’t pursue it here, one can see how to combine this with epistemic logic.
Since for every φ* in Σ, we just consider φ = φ* ∧ Ω = (φ* ∧ Ω) ∧ Ω = φ ∧ Ω in ΣΩ.
I just intend this to mean that they’re given by some subject’s epistemic state; everything I say here is compatible with objective Bayesianism.
" For an interesting alternative exposition, see Jaynes’ Probability Theory: The Logic of Science (Ch. 1-2)."
Great rec! The hilarious snarky remarks left by Jaynes everywhere in the book does not hurt its appeal.
In a graduate logic seminar at St Ands I learned a really good way to think about sigma-algebras. Imagine you're playing some variant of a game called the Observer Game where you make observations about what happened according to a ruleset. There are many rulesets, but every ruleset must be logically closed (if "X happened" is a possible observation in your ruleset, then "X didn't happen" must also be a possible observation in your rule set, and so on). A sigma-algebra is a ruleset for some variant of the Observer Game. This why the smallest possible sigma algebra is {⊤, ⊥} corresponding to the observations "nothing happened"/"something happened".