PR #1435

open

Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)

by AbhayAnandUCSDView on GitHub
val_bpb
1.0980
Architecture
Transformer
Optimizer
Muon
Artifact Size
~14.6 MB

Training Techniques

Architecture
depth recurrence
11 physical layers with layers 4 and 5 repeated once to create 13 virtual layers; recurrence activated at step 3000.
parameters: {"physical_layers":11,"virtual_layers":13,"repeat_layers":[4,5],"activate_step":3000}
BigramHash
Bigram hash embedding added on top of the recurrence base with SmearGate.
parameters: {"buckets":1536,"dim":112}
SmearGate
Gating mechanism used with BigramHash.
parameters: null
U-Net skip connections
Learnable residual gating on skip connections.
parameters: null
parallel residuals
Attention and MLP are run in parallel lanes in later layers.
parameters: {"layers":"7+"}
GQA
Grouped query attention with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Value Embedding
Value embedding added in later layers.
parameters: {"dim":128,"layers":[9,10]}
Partial RoPE
Partial rotary position embeddings applied to a subset of dimensions.
parameters: {"dims":"16/64"}
XSA
XSA applied to all layers.
parameters: {"layers":11}
weight tying
Tied embeddings.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.9965}
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: 0.09
momentum: 0.99
other_params: {"lr":0.02,"backend_steps":5}
Adam
weight_decay: null
momentum: null
other_params: {"lr":0.008,"role":"head"}
AdamW
weight_decay: 0.09
momentum: null
other_params: {"lr":0.6,"role":"embeddings"}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"lr":0.02,"role":"scalars"}
LR Schedule
warmdown
parameters: {"warmdown_fraction":0.667}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Evaluation
sliding window eval
parameters: {"stride":64}
Quantization
GPTQ
bits: 6
scope: all
Compression
Brotli
level: null

Novel Contributions

  • Depth recurrence with layers 4 and 5 repeated once to create 13 virtual layers
  • BigramHash(1536, dim 112) with SmearGate added on top of the recurrence base
  • EMA decay 0.9965 tuning
  • Combination of skip gates, parallel residuals, and MuonEq-R in the stack
  • GPTQ int6 quantization with Brotli compression to fit the artifact budget