PR #487
openNon-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) on 11L Production Stack
by anantdgoelView on GitHub
val_bpb
1.1720
Architecture
Transformer
Optimizer
Muon
Artifact Size
19.4 MB
Training Techniques
Architecture
MLP3x
Uses a 3x MLP multiplier in the 11-layer production stack.
parameters: {"multiplier":3}
SmearGate
Community stack component included in the model architecture.
parameters: null
BigramHash
Bigram hashing feature with bucketed representation.
parameters: {"buckets":2048,"dim":128}
XSA
Applies XSA to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Uses rotary positional embeddings on only part of the dimensions.
parameters: {"dimensions":16}
Value Residual
Caches layer-0 value vectors and mixes them into subsequent layers via learnable scalars.
parameters: {"added_params":22}
Gated Attention
Adds a per-head sigmoid gate after scaled dot-product attention to suppress attention sinks.
parameters: {"added_params":37000}
Initialization
OrthoInit
Orthogonal initialization used in the production stack.
Regularization
weight decay
parameters: {"value":0.04}
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: {"decay":0.997}
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_start":0.92,"warmup_steps":1500,"backend":5}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"scalars"}
Compression
zstd
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_steps":3000,"warmup_steps":20}
Novel Contributions
- Value Residual: caches layer-0 value vectors and mixes them into subsequent layers via learnable scalars.
- Gated Attention: per-head sigmoid gate after scaled dot-product attention to suppress attention sinks.
- Demonstrated additive gains from combining Value Residual and Gated Attention on a 9L baseline.
- Integrated both techniques into an 11-layer production meta-stack with multiple community techniques.