Information Theory and Deep Learning
Cite this work
Show citation formats
Kennon Stewart (2025). Information Theory and Deep Learning.@inproceedings{information_bottleneck_2025,
title={Information Theory and Deep Learning},
author={Kennon Stewart},
year={2025},
}ML needs saving…from ML researchers.
Machine learning hasn’t been a single discipline for a long time. It is a group of statisticians relying on computer scientists, relying on mathematicians, ad infinitum. More recently, however, we have started relying heavily on information theorists, who are the “librarians” of quantitative science.
This is a welcome change since we’re beginning to understand less and less about the internal mechanics of our models. Massive architectures like transformers and diffusion models operate by rules that are not easily described by classical parametric statistics. We can no longer just look at weights and biases; we have to look at the flow of information.
Model interpretability only takes us so far. A linear regression trained on wage data can be understood in terms of currency. But a neural network trained on wage data, freight logs, and logistics data quickly becomes incredibly complex even in the simplest models.
At some point, we stopped asking for interpretability and started measuring a model by its information. How much “relevant” information does the model contain about the future? And what does “relevant” even mean in the space of all variational models?
These questions plagued deep learning for years, until someone finally picked up a book. Naftali Tishby published his paper on the use of information theory to deep learning and the field looked to the information scientists for reference.
Side Note: Talk to Other People.
There’s many things to be learned from Tishby’s application of entropy to variational models. The strongest takeaway is that he found such use in a theory published in 1948: there is something to be gained from interdisciplinary work.
Reading statistics journals is a wonderful way to spend Friday night, but there is little use in fragmented fields repeating the same theory in seven different languages. It is quite past time that researchers, quantitative scientists especially, talk to other people.
This plays a strong part in later work on information states, which is a control-theoretic problem. Stay tuned for the paper.
The Smallest Unit of Information
Entropy is the bedrock of information theory. Formally, for a discrete random variable with possible outcomes occurring with probability , the Shannon entropy is defined as:
It measures the expected amount of information (or “surprise”) revealed when takes on a specific value.
The term—often called the surprisal—is crucial. It maps high-probability events (the sun rising) to values near zero, and low-probability events (a solar eclipse) to high values. This metric intentionally inflates low-probability events. But why?
Claude Shannon’s insight was that in a sequence of independent events, common events tell us very little. If our goal is to learn as much about the underlying process as possible, we shouldn’t focus on the events that happen 99% of the time. We watch for the outliers. As calculated by , the rarest events carry the most information.
This concept is formalized in coding schemes like the Shannon-Fano code. If entropy ties information to probability theory, then Shannon-Fano turns that theory into a recursive algorithm. But this algorithm mimics a much deeper, more complex measure of reality: Kolmogorov Complexity.
The Best Explanation is Brief: Compressing Probable Events.
The Shannon-Fano algorithm is straightforward:
- Line up events in order of probability.
- Assign each event a code word length inversely proportional to its probability: .
- Transmit the code words instead of the events.
This algorithm maps frequently used symbols (like ‘e’ in English) to short binary codes, and rare symbols (like ‘z’) to long ones. This perfectly mimics Shannon’s entropy and provides a tractable way to efficiently encode information based on statistical expectations.
But what if we don’t have a probability distribution? What if we are looking at a single, static object—like a specific string of DNA or a trained neural network weights file?
In 1963, Andrey Kolmogorov proposed a different approach. Instead of measuring information based on the probability of an event occurring, he was interested in the fundamental complexity of the object itself.
Bridging Information Theory to Kolmogorov Complexity
Kolmogorov Complexity, denoted , is defined as the length of the shortest computer program (in a fixed universal language) that can output the object and then halt: where is a universal Turing machine, is the program, and is the length of that program in bits.
Unlike Shannon entropy, which requires a population of events to calculate probabilities, Kolmogorov complexity looks at the algorithmic structure of a single object.
- A string of 1,000,000 “A”s has low Kolmogorov complexity, because it’s represented by
print "A" * 1000000. - A string of 1,000,000 random bits has high complexity; the shortest program is likely just
print "01101...".
There is a mathematical connection between information and complexity. For a random variable drawn from a computable probability distribution , the expected Kolmogorov complexity approximates the Shannon entropy:
This tells us that compression is just a way of refining information. If a model can compress the data (low ), it has found a distillation of the original data that generalizes well to the future. And more importantly, understanding means the ability to disregard irrelevant information. When all information has truly been discarded, then you’ve found the essence of the object: its minimal possible representation.
MDL is a Sort of Practical Compromise
Kolmogorov complexity is conveniently incomputable (Li et al., 2008). You can never prove you have found the shortest program. But we do have something called the Minimum Description Length. This is described in two parts: the memory used to encode the model (structured, repetitive information) and that used to encode the data (residual information).
The applications are strong: Borschein et al. use MDL for learning causal structures (Borschein et al., 2021), Risannen demonstrated its use in model selection years ago (Risannen, 2007), and Blier et al. use variational codelength (MDL proxy) to profile a series of deep learning model (Blier et al., 2018). Despite the objections I have to the Blier analysis (or, more specifically, their interpretation), there is little doubt about the role of information in statistical models.
Kolmogorov complexity is the mathematical formalization of Occam’s Razor (on which I have rambled extensively).
- Overfitting: is low (zero error), but is huge (complex model). Total length is high.
- Underfitting: is low (simple model), but is high (high error). Total length is high.
- Optimal Learning: We find the minimum total description length: .
The last expression is one I’ve spent some time on understanding. It’s a two-part code breakdown of optimal (shortest) minimal representation. It says that the optimal code length for any individual object has two pieces: structured information, and unstructured randomness.
This sounds obvious but isn’t. It means the randomness of a process is actually two kinds of information. The structured information can be highly compressed. This sequence encodes the regularities in the data, if there are any at all. It’s much more efficient to store than the raw data because each code word can correspond to more than a single source word.
The second term is read “the description length of D given M,” meaning the remaining length of the code once the structured redundancies have been compressed. This conditional information is much more difficult to store. There is no regularity to exploit, and so the smallest representation of the information becomes the information itself.
A good learning algorithm will rely strongly on the former term. Along with being more efficient to store, it contains generalizable information to improve prediction.
Deep Learning and the Information Bottleneck
How does this apply to modern Deep Learning? Through the Information Bottleneck (IB) principle.
In deep learning, we don’t just want to compress data; we want to compress it while keeping only the parts that help us predict a target. If is our input (an image) and is our target (a label), we want a neural network to learn a representation that solves:
In other words, we want to minimize the unstructured randomness of our description in favor of structured information that generalizes well, This becomes much more than a statistical problem because it requires one of information. Firmly in the field of our friends, the information scientists.
Specifically, maximizing the Mutual Information between our representation of an object (an image a cat) and that of the object itself (our notion of a cat). This requires two things:
- Break it into pieces : Dissecting every image labeled “cat” to identify the attributes associated with such an object: fuzziness, ears, four legs (usually), and a tail.
- Put back the right pieces : After identifying the relevant attributes of “cat”, everything else is extraneous. The model disregards it in favor of more relevant information to be learned in the future.
This explains why deep networks generalize. They aren’t just memorizing data; they are forcing the input through a bottleneck, shedding irrelevant noise (high entropy) and retaining only the low-complexity patterns that matter.
A few years ago, this would’ve seemed counterintuitive. A model with more information should perform better than one with less. But deep learning is, among other things, deeply unintuitive. But models are not magic boxes that contain the history of the world, even they lack perfect memory. The best we can do is learn the signal from the noise. Only one of them is worth learning.
For myself as a researcher, this was a shift. We’ve moved from finding parameters to finding compressed representations with the blink of an eye. Without our knowledge, deep models went from diagnosing causality to learning the fundamental information content of an object itself. I, for one, am deeply excited to see what it is they’ve learned.
Li, Ming, and P M B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Third edition., Springer, 2008. Texts in Computer Science. Open WorldCat. Rissanen, Jorma. Information and Complexity in Statistical Modeling. Springer New York, 2007. Information Science and Statistics. DOI.org (Crossref), https://doi.org/10.1007/978-0-387-68812-1. Bornschein, Jorg, et al. “Prequential MDL for Causal Structure Learning with Neural Networks.” arXiv:2107.05481, arXiv, 2 July 2021. arXiv.org, https://doi.org/10.48550/arXiv.2107.05481. Blier, Léonard, and Yann Ollivier. “The Description Length of Deep Learning Models.” arXiv:1802.07044, arXiv, 1 Nov. 2018. arXiv.org, https://doi.org/10.48550/arXiv.1802.07044.