PR #182

open

Non-record: Linearized Neural Memory + TTT (val_bpb=1.1844)

by mihir-s-05View on GitHub
val_bpb
1.1844
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.5 MB

Training Techniques

Architecture
linearized neural memory
Titans-inspired neural memory added to each transformer block; cumulative gradient update is linearized into causal linear attention via cumsum/einsum and used as a gated residual between attention and MLP.
parameters: {"layers":10,"params_per_layer_overhead":"~8k"}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"targets":["Q","V","lm_head"]}
Initialization
overtone spectral embedding init
Uses overtone spectral embedding initialization with phase-transition residual mixing.
Quantization
int6
bits: 6
scope: middle layers
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: {"decoupled":true}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"embeddings/scalars"}
LR Schedule
warmdown
parameters: {"warmdown_steps":2500}
Other
other
FP16 embedding bypass
parameters: null

Novel Contributions

  • Adds a Titans-inspired neural memory module to each transformer block
  • Linearizes the memory update into causal linear attention using cumsum and einsum for fullgraph compilation
  • Places memory between attention and MLP as a gated residual
  • Combines the memory module with LoRA-based test-time training
  • Uses overtone spectral embedding initialization and phase-transition residual mixing
  • Applies FP16 embedding bypass and int6 quantization on middle layers
  • Uses Muon weight decay with AdamW for embeddings/scalars