PR #1171

open

1.1145 BPB: Parallel Muon + INT5 GPTQ + Legal TTT

by EthanYangTWView on GitHub
val_bpb
1.1145
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.38 MB

Training Techniques

Quantization
GPTQ
bits: 5
scope: all
Architecture
XSA
Cross-sequence attention applied to all layers.
parameters: {"layers":11}
SmearGate
Included as part of the architecture.
parameters: null
U-Net skip connections
Skip connections in a U-Net style.
parameters: null
LeakyReLU
LeakyReLU squared MLP activation.
parameters: null
Partial RoPE
Partial rotary positional embeddings.
parameters: {"fraction":"16/64"}
Test-Time Training
score-first TTT
parameters: {"epochs":5,"learning_rate":0.0001,"chunks":262144}
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: null
other_params: {"ns":5,"lr":0.025}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval_steps":50}
Initialization
OrthoInit
Used for model initialization.
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
magnitude pruning
parameters: {"sparsity":"3%"}
Compression
zstd
level: 22

Novel Contributions

  • INT5 GPTQ quantization with clip_range=15 to fit a larger model under the artifact limit
  • XSA applied to all 11 layers
  • Legal score-first chunked TTT where tokens are scored before any gradient update
  • Coprime-stride data loader without permutation arrays
  • Wallclock-adaptive warmdown schedule
  • Parallel Muon optimizer with overlapping communication and orthogonalization