Sharpening Occam's Razor

Simplicity doesn't just break ties.

Sep 03, 2024

Here’s what Britannica says about Occam’s Razor:

Occam’s razor is a principle of theory construction or evaluation according to which, other things equal, explanations that posit fewer entities, or fewer kinds of entities, are to be preferred to explanations that posit more. It is sometimes misleadingly characterized as a general recommendation of simpler explanations over more complex ones.

I think this is a pretty common attitude, but it’s mistaken. I’m going to defend the ‘misleading characterization’.

When we compare two words by alphabetical order, we start by considering their first letters; if there’s a tie, we consider their second letters; and so on. (If one of the words ends first, it wins the tie: ‘Oak’ precedes ‘Oakley’.) Thus, each letter takes a strong sort of priority over all subsequent letters, which are only tie-breakers. We call this lexical priority (the converse is lexical inferiority).

Across philosophy, various considerations are ruled lexically inferior. David Enoch admits that courts may favor evidence which has a shot at being knowledge of guilt (e.g., eyewitness testimony over statistical patterns), but thinks that accuracy takes lexical priority. Shelly Kagan admits that moral agents may selfishly favor their own well-being, but thinks that promoting aggregate welfare takes lexical priority. I admit that you should keep reading, but think that subscribing takes lexical priority:

Britannica’s Ockham holds that we should treat simplicity as a tie-breaker, lexically inferior to explanatory strength. This is a mistake. In general, we should weigh simplicity on a par with predictive strength because (i) it guards against overfitting, and (ii) it takes prior probability into account.

Overfitting

Here’s a simple case. Suppose we’re running a high-school physics experiment, testing how the restoring force of springs (F) depends on how far they’ve been stretched (x). Here are two hypotheses: first, that F is proportional to x; second, that F is proportional to the 1.007th power of x. Suppose the second hypothesis fits our data almost perfectly, while the first is somewhat noisier. Which should we prefer? I think it’s clear that we should prefer the first hypothesis, and do so on grounds of simplicity. Emphasizing simplicity guards against the danger of overfitting to noisy data. We want a theory of the phenomenon, not just a theory of our measurements. Our measurements are often our best guide to the phenomenon, and so are essential to theorizing; but we should be careful not to trust all of their quirks too blindly.

In philosophy (as with linguistics), our measurements are mostly armchair judgements about example cases. Often, these judgements are generated by fast-and-frugal mental heuristics, or become a bit hazy upon reflection. So, they are noisy; emphasizing simplicity guards against the danger of overfitting. Williamson’s Overfitting and Heuristics in Philosophy (which should be coming out around the same time as this blog post) discusses this much further, with a very nice case study on hyperintensionality.

Prior Probability (for background, see here)

Here’s a simple way of quantifying predictive strength: the degree to which a hypothesis h predicts some evidence e is just the conditional probability of e on h. So, h₁ predicts e better than h₂ just in case P(e|h₁) > P(e|h₂). But after collecting our data, we want to compare the two hypotheses given our new knowledge. Applying Bayes’ Theorem, P(h₁|e) > P(h₂|e) just in case P(e|h₁) P(h₁) > P(e|h₂) P(h₂). So, if our hypotheses started off equally plausible, i.e., P(h₁) = P(h₂), we should prefer the one that better predicts our data. That seems right. Similarly, if our hypotheses predict our data equally well, i.e., P(e|h₁) = P(e|h₂), we should prefer the one that started off more plausible. That also seems right. Notice that neither consideration takes lexical priority; a higher prior plausibility can be traded off against a better prediction. Formally, predictive strength P(e|h) and prior probability P(h) have the same impact on evidential probability P(h|e).

What does this have to do with simplicity? As Jaynes puts it:

Generations of writers opined vaguely that ‘simple hypotheses are more plausible’ without giving any logical reason for it. We suggest that this should be turned around: we should say rather that ‘more plausible hypotheses tend to be simpler’. An hypothesis that we consider simpler is one that has fewer equally plausible alternatives.

Consider the high-school physics experiment again. We have some background knowledge that physical laws generally have clean exponents; so, it’s plausible that Hooke’s Law will also involve clean exponents. Simple hypotheses are more plausible. At the same time, a linear relationship is on par with no relationship or an inverse relationship, whereas an exponent of 1.007 is on par with exponents of 1.008, 1.009, etc. More plausible hypotheses (with fewer competitors) tend to be simpler. For a more careful and formal treatment, see Jaynes’ Probability Theory (Ch. 20).

In philosophy, arbitrary ad-hoc theories have much lower prior plausibility. So, even if they predict our data slightly better, they may still be less plausible overall.

Does Size Matter?

Britannica’s Ockham holds that the relevant consideration is the number of (kinds of) entities. Somewhat ironically, this principle itself might be an extra entity that we can do without. To the extent that postulating fewer (kinds of) entities (i) guards against overfitting or (ii) reflects a higher prior plausibility, theories which postulate fewer (kinds of) entities are preferred. When neither (i) nor (ii) hold, Ockham’s razor seems ill-motivated: I’m a brain in a vat, and nothing else exists appears to postulate fewer (kinds of) entities on the whole, and predicts your apparent observations just as well; but it’s a horrible theory. Generally, though, theories which postulate extra (kinds of) entities with redundant or missing roles have lower prior plausibility — roughly, they’re on par with theories which give these entities non-redundant roles.

At best, Ockham’s razor is redundant, since we can directly appeal to (i) or (ii); at worst, it get things wrong. It’s also simpler to say ‘simpler’ than to invoke the name of some old monk. Has Ockham been cut with his own razor?

Repetition isn’t always redundancy, though: maybe this reminder to subscribe will make you actually do it.

Plasma Bloggin'

Sep 5

I'm shocked at Brittanica's explanation of Occam's razor, which just sounds flagrantly incorrect to me. As I understand it, most philosophers (correctly) agree with you that simplicity is roughly on par with explanatory depth as far as theoretical virtues go.

Expand full comment

Offhand Quibbles

Discussion about this post