Reading

Probability Theory: or the Art of Making a Good Guess

Published: 9/22/2025Authors: Kennon Stewart

Cite this work

Show citation formats
APA
Kennon Stewart (2025). Probability Theory: or the Art of Making a Good Guess.
BibTeX
@inproceedings{probability_theory_2025, title={Probability Theory: or the Art of Making a Good Guess}, author={Kennon Stewart}, year={2025}, }

First Things First: There’s a Good Kind of Randomness.

Cities, climate, and nature all display a high degree of randomness, but there’s something else there too. The extent to which a process can change (called variance) is broken into two pieces: explainable and unexplainable error.

Explainable error is a bit of a misleading title, but quantifies the change that we can model. This isn’t randomness so much as the true underlying motion of the data. And it’s entirely dependent on the (often simplistic) model we choose.

Unexplained error is the randomness that hides the process’ real information. It’s an artifact of only having so much data, so many observations to work with. We’ll never be able to measure the whole world, and so statisticians make a point estimate and quantify the amount by which we’re wrong.

This matters when we interpret noisy sensor data, design algorithms with randomness, or ensure robustness in the face of incomplete information (phrasing intentional, we get into information theory later). It doesn’t do to provide a guess for a changing environment, we have to admit the possibility that we’re wrong as well as the degree to which that is possible.

And so it’s a good thing we have sources like Casella and Berger, who made a browsable reference for probability and statistics. The text, published in 1990 and conveniently free online, is still fundamental for causal inference and prediction. That includes models like ChatGPT.

We highlight recurring themes from probability theory as they appear in our work. A single article won’t do justice to the field, but consider this a nice primer.

The Probability Space

A probability space is a triple (S,F,P)(S, \mathcal{F}, P) where:

  • SS: the sample space (all possible outcomes).
  • F\mathcal{F}: a σ\sigma-algebra of subsets of SS (the measurable events).
  • PP: a probability function that maps events from our sample space to the real line R.\mathbb{R}.

As the experiment progresses, the sample space containing our outcomes becomes increasingly more complex. It is segmented into increasingly fine partitions, each mapping to a numeric probability via the probability function. Each of those partitions is an event. Distinct from the concept of an outcome, an event is a group of outcomes that we can recognize and map to a value.

As the model becomes more refined, so do the partitions. The model distinguishes between two similar events and can map them to their own probabilites. Each of the partitions is called a σ\sigma-algebra, and describes the model’s information at a single point in time. More refined partitions means more information.

And the collection of these σ\sigma-algebras together describes the learning proces of our model. It essentially maps the model’s journey from naive inference (no outcome is indistinguishable from another) to increasingly educated guesses (two seemingly similar events can actually map to wildly different probability values). This is called a filtration, and apparently it’s a big deal for those who study stochastics.

For us, it’s just a way to measure our model’s progress. We want a model strong enough to map the important events to a probability without fitting it to the sample. We won’t go too much into this, but this is one of those moments where some information is better to forget.

Kolmogorov’s Axioms are the Foundation.

Probability is built on three axioms:

  1. Non-negativity: P(A)0P(A) \geq 0 for all AFA \in \mathcal{F}.
  2. Normalization: P(S)=1P(S) = 1.
  3. Countable Additivity: If A1,A2,A_1, A_2, \dots are disjoint, P(i=1Ai)=i=1P(Ai).P\Big(\bigcup_{i=1}^\infty A_i\Big) = \sum_{i=1}^\infty P(A_i).

From these, we can derive continuity, complements, and conditional probability. They’re famous because they’re entirely derived from set and measure theory, giving them a strong theoretical backing. But more importantly, they allow us to make better guesses.

Rules like normalization let probabilists compare one random variable to another in terms of their likelihood: am I more likely to see a dolphin at the beach or a blue whale?. And countable additivity means that we can infer new events from their component probabilities.

There are cases when these rules don’t necessarily hold, and they’re few and far between. But the framework largely scales from coin flips to the scale of Google Gemini.

What are common random variables?

  • Coin Tosses: P({H})=0.5P(\{H\}) = 0.5, P({T})=0.5P(\{T\}) = 0.5.
  • Twins Example: Assign probabilities to “identical twins,” “fraternal twins,” and “female twins” using set intersections and unions.
  • Continuous Example: S={xR:x>0}S = \{x \in \mathbb{R}: x > 0\}, with PP defined by an exponential distribution.

But how does this relate to machine learning?

Because, at their best, ML models are really good probability machines. And at their worst, they’re as good as flipping a coin (a theoretically-sound statement). But an ML engineer who can define our random variables well can outperform complex models with a fraction of the size.

  • Bayesian inference is the probabilistic version of a guess-check loop. Posterior updates are probability measures over parameter sets. As new information is revealed, our belief regarding our model continues to change, making for a flexible style of learning.
  • Generalization bounds: Models perform notoriously poor on unseen datasets, especially when the production environment differs than training. But we can measure the model’s ability to adapt to new settings with a probability of our model over unseen data, called generalization.
  • Stochastic optimization: Algorithms like Stochastic Gradient Descent rely on treating gradients as random variables drawn from a probability space. This was a foundational move from building empirical risk minimizers (models fitted to data it’s already seen) that generalizable to unseen data.

Whether explicit or implicit, probability is the glue that holds machine learning theory together.

Further Reading

Casella, George, and Roger Berger. Statistical Inference. 2nd ed., Chapman and Hall/CRC, 2024. DOI.org (Crossref), https://doi.org/10.1201/9781003456285. Li, Ming, and P M B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Third edition., Springer, 2008. Texts in Computer Science. Open WorldCat.