PR #965

open

Architectural Record: 1.11837 BPB via KGIIR Trajectory Mixing

by Adam-JacuchView on GitHub

val_bpb

1.1184

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

KGIIR

Kinematic Gated IIR trajectory mixing added alongside existing token shifts to model hidden state momentum with a recursive 4-tap IIR filter.

parameters: {"taps":4}

XSA

Existing temporal shift mechanism used in the base architecture.

parameters: {"last_n":4}

VE128

Value Residual / VE component used in the base architecture.

parameters: {"dim":128,"layers":[9,10]}

weight tying

Tied embeddings used in the model.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"every":50}

Quantization

late QAT

bits: null

scope: model

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9,"batch_seqs":32,"grad_clip":1}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Regularization

LN scale

parameters: {"ln_scale":1}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 32768

eval_length: null

Evaluation

stride-based eval

parameters: {"stride":64}

Novel Contributions

Kinematic Gated IIR (KGIIR) trajectory mixing
Recursive 4-tap IIR-style hidden-state momentum filter
Fused CUDA kernel implementation for 88ms/step throughput on 8xH100
Controlled ablation showing BPB improvement from 1.11923 to 1.11837
Trajectory mixing alongside existing temporal shifts to improve Pareto frontier