PR #952
closedUltimate: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)
by FlashyFlash3011View on GitHub
val_bpb
1.1144
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
—
Training Techniques
Architecture
Gated Attention
Per-head sigmoid gate with near-no-op initialization.
parameters: {"weight_init":0,"bias_init":4}
Value Residual
Layer-0 value injected into all subsequent layers.
parameters: {"lambda_init":[0.5,0.5]}
BigramHash
Restored larger bigram hash vocabulary.
parameters: {"vocab":2048,"dim":128}
XSA
Applied to the last 4 layers.
parameters: {"last_n_layers":4}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions with NTK scaling.
parameters: {"dimensions":"16/64"}
VE128
Value embedding module with 128 dimensions in selected layers.
parameters: {"dim":128,"layers":[9,10]}
Quantization
QAT
bits: 6
scope: all
late QAT
bits: 6
scope: non-bank params
Compression
lzma
level: 9
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Test-Time Training
score-first TTT
parameters: {"epochs":3,"learning_rate":0.002,"freeze_blocks":0,"momentum":0.9}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
Novel Contributions
- GatedAttention with per-head sigmoid gating
- ValueResidual injection from layer 0 into all layers
- Full-step QAT from the start of training
- lzma-9 compression to free artifact budget
- Restored BigramHash vocabulary from 1536 to 2048