PR #1232

open

feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB)

by Christopher-Lee-McClendonView on GitHub
val_bpb
1.0929
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.47 MB / 14.64 MB

Training Techniques

Architecture
Gated Attention
Attention mechanism with gated QK gain initialization.
parameters: {"qk_gain_init":1.5}
Value Residual
Adds value residual connections to the model.
parameters: null
XSA
Applies XSA to all transformer layers.
parameters: {"layers":11}
BigramHash
Uses hashed bigram embeddings.
parameters: {"buckets":4096,"dim":128}
GQA
Grouped query attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
SmearGate
Includes SmearGate in the architecture.
parameters: null
weight tying
Tied input and output embeddings.
parameters: null
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"embed_lr":0.035}
Compression
zstd
level: 16
Quantization
int6
bits: 6
scope: all
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","learning_rate":0.002,"epochs":10,"chunk_size":32768,"frozen_blocks":2,"grad_clip":1,"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"peak_lr_phase_steps":8000,"warmdown_steps":12000,"warmup_steps":20}

Novel Contributions

  • 20k-step scaling study of the PR940 architecture stack
  • Legal score-first test-time training achieving 1.0929 BPB
  • FlowRefiner variant showing the auxiliary flow head is essentially neutral at 20k steps
  • All-layer XSA, gated attention, value residual, and LeakyReLU² applied at 20k scale
  • Demonstration that warmdown from 8k to 20k steps drives most of the improvement