A transformer with a genuine time axis. Standard transformers are stateless between forward passes — they re-derive everything from the context window each call. This project injects a fixed, randomly-initialized reservoir into a pretrained transformer's mid-layer attention so its state evolves across passes, accumulating a history of the model's own attention dynamics. This session is a feasibility + dynamics study at small scale: can the injection be done without breaking the base model, and what reservoir regime turns the carried state into usable signal rather than noise?
W_in and writes its state back through a learned readout —
both at the same layer, every pass — so state accumulates across passes
(“implicit elapsed time”). Only the transformer’s fine-tune changes;
reservoir weights are random and fixed, and the best of N seeds is kept.
Whether a fixed random reservoir, injected into a pretrained transformer's mid-layer attention, can give it real cross-pass state without breaking the base model — and which reservoir-dynamics regime makes that state useful.
It relocates the proven reservoir-computing recipe (fix the recurrent weights,
train only a readout — Jaeger’s echo state networks, Maass’s
liquid state machines) into a pretrained transformer, and targets a gap the
expressivity literature makes precise: a finite-precision transformer is bounded
to TC⁰/FO(M) per forward pass, while state carried across
passes is the documented lever past that ceiling. Every prior recurrence-augmented
transformer (Transformer-XL, RMT, Block-Recurrent, Mamba, Titans…) uses
trained recurrence carrying state within a sequence; none uses a
fixed-random reservoir with state across independent passes.
Full survey: literature/REVIEW.md.
In progress. First result: the reservoir's echo state property breaks sharply at spectral radius ρ ≈ 1 (the edge-of-chaos boundary) in the autonomous regime — see Findings. Next: model surgery (H1 non-destruction) and the sweep on real GPT-2 attention streams.
At the neuron level, the mechanism is small: the reservoir nodes simply join the attention layer's key/value sequence. The same attention that moves information between token residual streams now also reads from and writes to the reservoir — via a fixed projection in and a learned readout out — and unlike the token streams, the reservoir state carries across forward passes.
W_in, written via the learned W_out), and the
reservoir state — unlike the token streams — persists across passes.
The architecture implies a different execution model from standard inference. A
stateless transformer is a request–response handler; the Reservoir Agent is a
persistent, always-alive process whose reservoir state and context buffer are owned
by the runtime and never wiped between passes. A scheduler decides when to run a
forward pass — prompted (new input arrives) or unprompted (an
idle timer fires) — and an output gate decides whether to emit or stay silent.
Forking a standard agent harness into this shape is long-horizon work (tracked in
todo.md); this session targets the model surgery and dynamics beneath it.
r(t), pinned in
GPU memory, mutated in place each pass) and a never-wiped context buffer persist
across passes. Unprompted passes let the agent keep processing with no new input;
the output gate emits only when confident, otherwise it updates state and schedules
the next pass. (Aspirational / compute-gated — see todo.md.)
In progress. The full write-up lives in
FINDINGS.md
and is built into a typeset PDF on every push. It states the
question, architecture, literature grounding, and method now, and reports experimental
results — the reservoir-dynamics characterization and the H1 non-destruction
regression — as they land. No result is claimed here until it has been measured.
First result (synthetic-input dynamics sweep). Driving a 200-unit reservoir across spectral radius ρ ∈ [0.1, 2.0], the echo state property — the reservoir forgetting its initial condition — holds cleanly for ρ < 1 and breaks sharply at ρ ≈ 1 (gold curve), exactly the edge-of-chaos boundary the classical theory predicts, now measured in this injection-oriented setup. Saturation and effective dimensionality rise smoothly with ρ. A nuance worth flagging: under unit-scale input drive the reservoir forgets its initial state across all ρ (strong input enforces the ESP), so the ρ ≈ 1 boundary is the regime that matters for unprompted, input-free passes — precisely where the agent runs on reservoir state alone.
python scripts/run.py sweep.