val_bpb
1.0901
Architecture
GPT-2
Optimizer
Muon
Artifact Size
15,976,317 bytes
Training Techniques
Architecture
U-Net skip connections
Encoder-decoder style skip connections with learnable gating to inject shallow states into deeper layers.
parameters: {"layers":null}
parallel residuals
Splits later layers into parallel attention and MLP lanes and merges them with a learnable gate.
parameters: {"parallel_start_layer":7}
depth recurrence
Repeatedly traverses selected middle layers multiple times in a single forward pass.
parameters: {"layers":[3,4,5],"start_step":3000}
Value Residual
Value embedding enhancement that injects extra representations into attention for selected layers.
parameters: {"layers":[9,10]}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.99
other_params: {"ns_steps":5,"warmup_momentum_start":0.92,"warmup_steps":1500}
AdamW
weight_decay: null
momentum: null
other_params: {"beta1":0.9,"beta2":0.95,"eps":1e-8,"scalar_lr":0.02}
Quantization
GPTQ
bits: 6
scope: all
Regularization
weight decay
parameters: {"value":0.095}
Compression
custom
level: 11
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"fraction":0.667}
Novel Contributions
- U-Net style skip connections with learnable gating
- Parallel residual lanes for attention and MLP in later layers
- Depth recurrence over middle layers
- Value embedding enhancements in attention
- Muon optimization for matrix weights with Newton-Schulz steps
- GPTQ 6-bit post-training quantization
- Selective pruning to fit under the 16MB artifact limit
- Brotli-based artifact compression
- Sliding window evaluation with stride 64