PR #720

open

Record Submission: 1.1078 BPB — XSA6 + BigramHash4K on Hedge Mixer Stack

by agalimovaView on GitHub
val_bpb
1.1078
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.3MB

Training Techniques

Architecture
XSA
Applies XSA to the last layers of the model.
parameters: {"layers":6}
BigramHash
Uses hashed bigram embeddings in the Hedge Mixer stack.
parameters: {"vocab_size":4096,"embedding_dim":128}
Partial RoPE
Uses rotary positional embeddings on a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"parameter_banking":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency_steps":50}
Quantization
GPTQ-lite
bits: 6
scope: model weights
Compression
zstd
level: null
Test-Time Training
score-first TTT
parameters: {"epochs":4,"optimizer":"AdamW"}
LR Schedule
cosine decay
parameters: {"warmdown":true}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

  • Systematic combinatorial search over hyperparameters using autoresearch-multi
  • Increasing XSA_LAST_N from 4 to 6
  • Increasing BIGRAM_VOCAB_SIZE from 2048 to 4096
  • Combination of XSA=6 and BigramHash vocab size 4096 with superadditive improvement
  • Hedge Mixer stack with BigramHash embeddings and XSA on the last 6 layers