Information States as Minimal Representations.
Cite this work
Show citation formats
Kennon Stewart (2025). Information States as Minimal Representations..@inproceedings{information_states_2025,
title={Information States as Minimal Representations.},
author={Kennon Stewart},
year={2025},
}Motivation
A quiet assumption baked into reinforcement learning and control theory is that some parts of the past can be forgotten. Or more precisely: that there exists a representation of history such that, once learned, the rest of the past no longer matters in sequential decision-making. This representation of history, called a state, is only sufficient when it enables such perfect decision-making in the absence of perfect knowledge.
In classical MDPs, this role is played by the state. In POMDPs, by the belief state. And in modern learning systems, by something far messier: buffers or embeddings or even a corpus of natural language traces.
These are the things weβve learned from the literature in an attempt to make these sufficiency conditions more concrete. We rely on existing classical control theory fundamentals like filtering while referencing up-and-coming applications like the approximate information state to follow.
The goal is not to restate the textbook definition of an information state, but to reinterpret it as a minimal representation problem, or a problem of compression under decision-theoretic constraints. Along the way, we will read Sinha & Mahajan carefully, quote them directly, and explain the direct applications to forefront of reinforcement learning.
The Core Problem: Growing Histories Used to Describe Stable States.
In a partially observed system, the agent does not see the true environment state. What it sees instead is a growing history of observations and actions.
A key conceptual challenge for POMDPs is that the data available at the agent---the history of observations and actions---is increasing with time. The standard approach is to compress this increasing data into a finite-dimensional statistic known as the belief state, which is the posterior density of the unobserved state conditioned on the history of observations and actions and, therefore, may be viewed as a generalization of non-linear filtering of controlled processes.
A key conceptual challenge for POMDPs is that the data available at the agent---the history of observations and actions---is increasing with time. The standard approach is to compress this increasing data into a finite-dimensional statistic known as the belief state, which is the posterior density of the unobserved state conditioned on the history of observations and actions and, therefore, may be viewed as a generalization of non-linear filtering of controlled processes.
This sentence is doing more work than it appears to.
The problem is not partial observability per se. It is unbounded memory growth. If the agent were allowed to retain the full history forever, optimal control would be trivial in principle (if not in practice).
Every notion of βstateβ in control theory is, at bottom, an answer to the same question:
What can we throw away without hurting optimal decision-making?
The belief state is the canonical answer---but it comes with two fatal flaws in modern settings:
- It requires a known generative model.
- It is computationally and representationally expensive.
Sinha and Mahajanβs contribution is to relax the belief-state requirement while preserving the logic of dynamic programming.
Agent States: History Compression Without a Model
Instead of assuming a belief state, Sinha introduces a more general object: the agent state.
We now describe the agent-state approach where it is assumed that instead of using the entire history of observations and actions to make a decision the agent maintains a local state (which we will refer to as the agent state and denote by ) and makes a decision as a function of the agent state. The agent starts with an initial state and recursively updates it as follows:
We now describe the agent-state approach where it is assumed that instead of using the entire history of observations and actions to make a decision the agent maintains a local state (which we will refer to as the agent state and denote by ) and makes a decision as a function of the agent state. The agent starts with an initial state and recursively updates it as follows:
This is the conceptual pivot of the paper.
The agent state is not required to be optimal, sufficient, or even Markov. It is simply a recursively updateable compression of history:
In modern terms: replay buffers, RNN hidden states, evidence graphs, optimizer memory, and cache summaries all qualify.
The key move is philosophical as much as technical: state is no longer given by the environment---it is chosen by the agent designer.
At this level, βagent stateβ is a permissive concept. Almost anything qualifies. But permissiveness alone does not recover optimality.
For that, we need information states.
Information States: Sufficiency, Not Memory
Sinha defines an information state via two precise sufficiency conditions.
The agent state is an information state if it satisfies the following two properties:
(P1) Sufficient for performance evaluation. There exists a function such that, for any , , and ,
(P2) Sufficient for predicting itself. There exists a controlled transition probability matrix such that, for any , , and ,
The agent state is an information state if it satisfies the following two properties:
(P1) Sufficient for performance evaluation. There exists a function such that, for any , , and ,
(P2) Sufficient for predicting itself. There exists a controlled transition probability matrix such that, for any , , and ,
This is the precise definition we should use, because it ties βstateβ directly to control.
Sinha and Mahajan are not saying that is a good summary because it is small, low-entropy, or elegant. They are saying it is good when two concrete substitutions are valid:
- Replace full history with when evaluating expected one-step reward.
- Replace full history with when propagating next-step state dynamics.
In other words, sufficiency here is operational: if these equalities hold, dynamic programming over is behaviorally faithful to the original history process.
The test is not βdid we compress?β but βdid we preserve decision-relevant prediction?β If both properties hold, then dynamic programming goes through cleanly.
Suppose the agent state is an information state, i.e., satisfies properties (P1) and (P2) with some . Define the dynamic program:
Then there is no loss of optimality in restricting attention to policies of the form ; equivalently, the resulting information-state policy is optimal.
Suppose the agent state is an information state, i.e., satisfies properties (P1) and (P2) with some . Define the dynamic program:
Then there is no loss of optimality in restricting attention to policies of the form ; equivalently, the resulting information-state policy is optimal.
This is the first real payoff of the framework.
The theorem says that once satisfies SinhaβMahajan sufficiency, history dependence is no longer needed for optimal control. You can optimize in the compressed state space and recover an optimal policy for the original problem.
Put differently: this is not βrepresentation learning because compression feels elegant.β It is a formal equivalence result between history-based control and state-based control under (P1) and (P2).
It also draws a clean boundary for engineers putting it into practice: the hard part is not solving the DP over ; the hard part is constructing a that reaches sufficiency.
Approximate Information States: Controlled Lossy Compression
Most agent states in practice fail (P1) and (P2). Sinhaβs response is not to abandon them, but to bound the damage.
Let with and , where , satisfy:
(AP1) Approximately sufficient for performance evaluation:
(AP2) Approximately sufficient for predicting itself:
where is a metric on .
Then the tuple is called an approximate information state (AIS).
Let with and , where , satisfy:
(AP1) Approximately sufficient for performance evaluation:
(AP2) Approximately sufficient for predicting itself:
where is a metric on .
Then the tuple is called an approximate information state (AIS).
The practical value of the Approximate Information State is that it satisfies an otherwise infeasible goal: perfect knowledge of the dynamics of a nonstationary system. The canonical belief state requires a continuous, and often infinite, belief space that is difficult to satisfy without storing the system dynamics explicity, action-by-action and state-by-state. Some sense of approximation, when well defined, is an appropriate substitute. And much more efficient.
Suppose one design updates a compact latent state , while another stores a running natural-language trace. The AIS lens asks a concrete question: which representation yields smaller for the same task and compute budget?
In many settings, latent states better preserve transition-relevant structure (smaller ), while language traces improve auditability but expand memory and can degrade control fidelity if used as the sole control state.
The engineering answer is usually a mix of two solutions: an embedded latent state for live inference, with batch runs to convert model inference to interpretable natural language. This is known as a kappa architecture, separating the real-time streaming inference that users see from the less urgent interpretability metrics.
I would even argue that this is the implicit basis of Newton and quasi-Newton optimization methods. This family of methods differ mainly in the modes used to represent and invert a particularly expensive Hessian operation.
Optimization Algorithms as Agent States
Newtonβs method, BFGS, and L-BFGS all maintain internal state across iterations. But precision matters. Jorge Nocedal proved that the reduction in precision induced by the BFGS operation would incur a controllable loss in accuracy. Mokhtari and Ribeiro carefully do a similar job here. And this is exactly the approximation of an otherwise intractable information state.
But there is also the converse of approximation, which is refinement. Under this framing the auxiliary state of the Newton optimizer corresponds to a high-complexity agent state, which is the Hessian itself. The low-memory equivalent L-BFGS corresponds to a compressed agent state. And online variants correspond to approximate information states, where we work under the assumption of a nonstationary policy that may perform better than a stationary policy.
This is not a claim that these methods optimize MDL, and I would interpret that to be a fault. I instead claim that they can be reinterpreted as navigating a tradeoff between the quality of memory, prediction, and decision-making.
Information States as Minimal Representations
Iβm willing to admit at this point, my disagreement with the framework is narrow. But it is urgent in domains where precision comes at a cost.
Sinha and Mahajan give a rigorous behavioral definition of sufficiency: if two representations induce the same control-relevant reward and transition predictions (exactly or approximately), they are equally valid for dynamic programming. That is a strong theorem and a valuable one.
But it leaves an engineering gap. Two agent states can be equally sufficient in the SinhaβMahajan sense while having very different content, memory footprint, latency, interpretability, and deployment cost. In practice, teams do not choose state representations in a vacuum---they choose under compute budgets, hardware limits, safety review requirements, and maintenance constraints.
So the open problem is not whether Theorem 1 is correct. It is whether control-relevance alone is the right endpoint for representation design.
My claim is that it is not.
A useful next definition of sufficiency should include both sufficient control of performance in addition to codelength. Such a definition would mirror the notion of a minimal statistic in probability theory, which is the minimal benchmark for probabilistic modeling.
This is a direction to watch, particularly as agents are deployed to memory-constrained devices where large agent states are infeasible. It becomes a barrier to a truly smart, reactive city when the models embedded to buses and trains add lag to the user experience.
It becomes imperative that we optimize on more than just control.
Concretely, that suggests adding a second objective to the AIS lens: among representations with comparable , prefer the one with lower descriptive complexity and implementation burden. In upcoming work, I will connect this to MDL/Kolmogorov-style language, where representation quality is judged not only by control fidelity but also by codelength.
In that sense, information states are not just filters over history. They are design decisions tailored to the agentβs environment and the problem at hand. As such, it does well to appropriately define our constraints.