val_bpb
1.0970
Architecture
Transformer
Optimizer
Muon
Artifact Size
~16.07 MB
Training Techniques
Architecture
depth recurrence
3-layer recurrence applied to layers 3-5 with two loops, activated partway through training.
parameters: {"layers":[3,4,5],"loops":2,"activated_at_frac":0.35}
parallel residuals
GPT-J style parallel residual path where attention and MLP read from the same input in later layers.
parameters: {"start_layer":7}
U-Net skip connections
Sigmoid-gated skip connections used in a U-Net-like pattern.
parameters: null
RoPE
Rotary positional embeddings with 32 dimensions.
parameters: {"dimensions":32}
weight tying
Tied input and output embeddings.
parameters: null
KV head count
Grouped attention configuration with fewer KV heads than query heads.
parameters: {"heads":8,"kv_heads":4}
Regularization
logit softcap
parameters: {"value":20}
layerwise LN scale
parameters: null
weight decay
parameters: {"muon":0.095,"adam":0.02,"embed":0.085}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: attention and MLP matrices
GPTQ
bits: 8
scope: embeddings
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Compression
Brotli
level: null
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Novel Contributions
- SP8192 tokenizer with 8192-vocab SentencePiece BPE
- 3-layer depth recurrence in layers 3-5
- Parallel residuals in later layers
- Sigmoid-gated U-Net skip connections
- Learnable per-head QK gain scaling
- Full-Hessian GPTQ with SDClip quantization
- Brotli-compressed artifact
- Sliding window evaluation without TTT