Information Theory and Deep Learning

Kennon Stewart

Reading

Information Theory and Deep Learning

Published: 12/26/2025•Authors: Kennon Stewart

📄 View PDF

Cite this work

Show citation formats

APA

Kennon Stewart (2025). Information Theory and Deep Learning.

BibTeX

@inproceedings{information_bottleneck_2025,
  title={Information Theory and Deep Learning},
  author={Kennon Stewart},
  year={2025},
}

📥 Download BibTeX 📥 Download APA

Information Theory and Deep Learning

Machine learning is increasingly interdisciplinary. Modern systems combine ideas from statistics, computer science, optimization, and information theory. That mix is useful: as models become larger and less transparent, we still need to be able to understand the architecture of these deep learning models. This isn’t just to validate the thinking and reasoning of the model, but to even discover new ways to approach problem-solving.

The information bottleneck principle helps us do this. The authors’ core argument is that the strength of the deep learning model is its ability to find the combination of provided (interpretable) parameters, and performing a series of linear transformations to help it reason through problems [Alemi_Poole_2023].

The key here is in the transformations. The learning algorithm is designed to find the absolute most relevant pieces of information from an otherwise noisy training set and discard the irrelevant noise. Doing so not only causes the model to reason well on a training set, but for environments and problems it has never encountered. This is called generalization.

Tishby and colleagues helped popularize this perspective in deep learning through the Information Bottleneck (IB) framework. The key idea is not just to represent data, but to represent it selectively—preserving predictive structure while compressing irrelevant variation.

The Smallest Unit of Information

Entropy is foundational in information theory. For a discrete random variable $X$ with outcomes $x_i$ and probabilities $p_i$ , Shannon entropy is

$H(X) = -\sum_{i} p_i \log_2(p_i).$

It measures expected surprisal: common events contribute less, rare events contribute more. This quantity determines lower bounds for average code length in lossless compression and formalizes the relationship between uncertainty and coding efficiency.

Coding schemes such as Shannon-Fano or Huffman coding operationalize this idea by assigning shorter code words to more probable symbols.

Ways of Describing Information and Complexity.

There are a few ways of quantitatively measuring the “information content” of a random variable, as well as its relative “complexity”.

The information (amount of information obtained of the process when we draw a sample) is indicated by the Shannon Entropy.

The Kolmogorov Complexity is a high-level theoretical concept that is actually intractable to compute directly. It describes the shortest program (on a fixed universal machine) that outputs a specific object $x$ .

The Minimum Description Length is the actual computational form of $K(x),$ and it describes the shortest program (on a fixed universal machine) that outputs a specific object $x$ .

Kolmogorov complexity is defined as $K(x) = \min_{p:U(p)=x}|p|,$

where $U$ is a universal Turing machine. Unlike Shannon entropy, this is about an individual object rather than an ensemble distribution.

A useful (but easy to overstate) connection is that for computable sources, expected algorithmic complexity can track Shannon-style uncertainty up to additive terms. In practice, this motivates compression-based reasoning, but it does not make Shannon entropy and Kolmogorov complexity interchangeable.

MDL as a Practical Compromise

Kolmogorov complexity is not computable in general ([Li_Vitányi_2008] break this down really well), so practical learning uses approximations. MDL frames model selection as minimizing total description length: $\min_M [L(M) + L(D|M)],$ where:

$L(M)$ is the description length of the model,
$L(D|M)$ is the description length of data not captured by the model.

This decomposition gives an intuitive bias-variance style tradeoff:

Overfitting: tiny $L(D|M)$ , large $L(M)$
Underfitting: small $L(M)$ , large $L(D|M)$
Good generalization: balanced total codelength

MDL-based reasoning has been used in model selection and in recent causal/representation learning work [Rissanen_2007][Blier_Ollivier_2018][Bornschein_Chiappa_Malek_Ke_2021].

Deep Learning and the Information Bottleneck

In supervised learning, let $X$ be input, $Y$ the target, and $Z$ an internal representation. The IB objective is often written as:

$\min_Z [I(X;Z) - \beta I(Z;Y)].$

Intuitively:

the $I(X;Z)$ term encourages compression of the input,
the $-\beta I(Z;Y)$ term rewards target-relevant information.

Because the objective is minimized, maximizing $I(Z;Y)$ is implemented by making the negative term $-\beta I(Z;Y)$ as small as possible (i.e., more negative), which pushes representations to retain predictive content.

This formalizes the idea that useful representations discard nuisance variation while retaining target-relevant structure.

A practical way to interpret this: representation learning is selective compression. The goal is not “more information at all costs,” but the right information for the prediction task.

References

Alemi, Alexander A., Poole, Ben (2023). Variational Prediction. arXiv. https://doi.org/10.48550/arXiv.2307.07568
Li, Ming, Vitányi, P M B. (2008). An introduction to Kolmogorov complexity and its applications. Springer.
Rissanen, Jorma (2007). Information and Complexity in Statistical Modeling. Springer New York. https://doi.org/10.1007/978-0-387-68812-1
Blier, Léonard, Ollivier, Yann (2018). The Description Length of Deep Learning Models. arXiv. https://doi.org/10.48550/arXiv.1802.07044
Bornschein, Jorg, Chiappa, Silvia, Malek, Alan, Ke, Rosemary Nan (2021). Prequential MDL for Causal Structure Learning with Neural Networks. arXiv. https://doi.org/10.48550/arXiv.2107.05481