Gated DeltaNet built from first principles, from linear attention to Mamba
A tutorial article builds up the full theoretical lineage — linear attention, state-space models, Mamba, and the delta rule — to explain how Gated DeltaNet (Yang, Kautz & Hatamizadeh, 2024; ICLR 2025) works from scratch.
Score breakdown
The tutorial makes the architecture of Gated DeltaNet fully derivable from first principles by tracing the exact theoretical lineage through linear attention, SSMs, and Mamba, rather than presenting it as a black-box model.
- 01Gated DeltaNet is described as a precise merger of Mamba2 and DeltaNet, presented at ICLR 2025 (Yang, Kautz & Hatamizadeh, 2024).
- 02The tutorial's unifying mental model: all architectures in this family maintain a matrix memory `St ∈ ℝdv×dk` and differ only in their write rule.
- 03Standard Transformer attention costs O(L²) in training and O(t) memory per inference step; the goal is O(1) memory and compute per generated token.
The article opens by framing the central problem: standard Transformer attention scales quadratically in training (`O(L²)` in sequence length) and requires a KV cache that grows linearly in memory at inference. The goal of the entire sub-quadratic model family — linear attention, state-space models, Mamba, DeltaNet, and Gated DeltaNet — is to behave like an RNN at inference time, maintaining a fixed-size state that gives `O(1)` memory and compute per generated token, while remaining parallelizable during training.
The tutorial introduces a unifying mental model: every architecture in this family maintains a matrix-valued memory `St ∈ ℝdv×dk` and differs only in its write rule.
The tutorial introduces a unifying mental model: every architecture in this family maintains a matrix-valued memory `St ∈ ℝdv×dk` and differs only in its write rule. Linear attention (Katharopoulos et al., 2020) defines the recurrence `St = St−1 + vt kt⊤`, an additive rank-1 outer product update, which the article connects to classical linear associative memory ("fast weights," in Schmidhuber's 1990s terminology). It then identifies two failure modes of this purely additive scheme: memory collision once the number of stored key–value pairs exceeds `dk`, and no recency control, causing degradation on long sequences. These two failures map directly onto the two parent architectures of Gated DeltaNet — gating (multiply `St−1` by a decay before adding, as in Mamba/Mamba2, GLA, and RetNet) and the delta rule (erase the old value at a key before writing the new one, as in DeltaNet).
The Mamba prerequisite section traces the lineage from continuous state-space models through discretization via zero-order hold, arriving at the discrete recurrence `ht = Āht−1 + B̄ut`. It explains that `Ā = exp(ΔA)` encodes exponential decay and that the step size `Δ` acts as a knob between ignoring a token and resetting on it. S4 (Gu et al., 2021) used fixed `A, B, C, Δ` per channel, making the recurrence equivalent to a convolution — fast but unable to treat tokens differently. The source text is truncated before the tutorial reaches the selective (input-dependent) Mamba stage and the full assembly of Gated DeltaNet.
Key facts
- 01Gated DeltaNet is described as a precise merger of Mamba2 and DeltaNet, presented at ICLR 2025 (Yang, Kautz & Hatamizadeh, 2024).
- 02The tutorial's unifying mental model: all architectures in this family maintain a matrix memory `St ∈ ℝdv×dk` and differ only in their write rule.
- 03Standard Transformer attention costs O(L²) in training and O(t) memory per inference step; the goal is O(1) memory and compute per generated token.
- 04Vanilla linear attention uses a purely additive update `St = St−1 + vt kt⊤`, which causes memory interference and prevents recency control.
- 05Gating (multiplying `St−1` by a decay) is the Mamba/Mamba2 repair; the delta rule (erasing the old value before writing) is the DeltaNet repair.
- 06S4 (Gu et al., 2021) used fixed A, B, C, Δ per channel, making the recurrence equivalent to a convolution but treating every token identically.
- 07The discretization step size Δ acts as a knob: small Δ leaves the state unchanged, large Δ makes the state dominated by the current input.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 11, 2026 · 08:34 UTC. How this works →