PR #965

open

Architectural Record: 1.11837 BPB via KGIIR Trajectory Mixing

by Adam-JacuchView on GitHub
val_bpb
1.1184
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
KGIIR
Kinematic Gated IIR trajectory mixing added alongside existing token shifts to model hidden state momentum with a recursive 4-tap IIR filter.
parameters: {"taps":4}
XSA
Existing temporal shift mechanism used in the base architecture.
parameters: {"last_n":4}
VE128
Value Residual / VE component used in the base architecture.
parameters: {"dim":128,"layers":[9,10]}
weight tying
Tied embeddings used in the model.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Quantization
late QAT
bits: null
scope: model
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9,"batch_seqs":32,"grad_clip":1}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Regularization
LN scale
parameters: {"ln_scale":1}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 32768
eval_length: null
Evaluation
stride-based eval
parameters: {"stride":64}

Novel Contributions

  • Kinematic Gated IIR (KGIIR) trajectory mixing
  • Recursive 4-tap IIR-style hidden-state momentum filter
  • Fused CUDA kernel implementation for 88ms/step throughput on 8xH100
  • Controlled ablation showing BPB improvement from 1.11923 to 1.11837
  • Trajectory mixing alongside existing temporal shifts to improve Pareto frontier