PR #1322

open

Lucky V — 1.08540457 val_bpb (seed 444)

by newjordanView on GitHub
val_bpb
1.0854
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,438,323 B

Training Techniques

Architecture
GQA
11-layer Transformer with 8 attention heads and 4 KV heads
parameters: {"layers":11,"dim":512,"heads":8,"kv_heads":4}
weight tying
Tied input and output embeddings
parameters: null
RoPE
Rotary positional embeddings with 16 dimensions and base 10000
parameters: {"dimensions":16,"base":10000}
LeakyReLU
LeakyReLU-squared custom activation
parameters: null
XSA
XSA attention used on all 11 layers with Flash Attention 3
parameters: {"layers":11}
BigramHash
Bigram hash table side channel
parameters: {"dimensions":128,"vocab":2048}
VE128
Value embeddings on layers 9-10
parameters: {"layers":[9,10],"dimensions":128}
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_iterations":7,"muoneq_r":true}
Weight Averaging
SWA
parameters: null
EMA
parameters: null
Quantization
late QAT
bits: 6
scope: weights
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"steps":32,"learning_rate":0.04}

Novel Contributions

  • MuonEq-R row-normalized momentum before Newton-Schulz orthogonalization
  • Per-head QK gain initialized to 5.0
  • More Newton-Schulz backend iterations (7 instead of 5)
  • SLOT32 test-time adaptation with lr=0.04
  • Sliding-window evaluation with SLOT adaptation
  • Late QAT with fake int6 quantization
  • Bigram hash side channel and value embeddings