val_bpb
1.1807
Architecture
Transformer
Optimizer
Muon
Artifact Size
14,461,499 bytes
Training Techniques
Architecture
MLP3x
Uses a 3x-expanded MLP with 1536 hidden units and relu-squared activation.
parameters: {"hidden_units":1536}
BigramHash
Hashes consecutive token pairs into a 10240-bucket embedding table with learnable scale.
parameters: {"buckets":10240,"dimension":128,"scale":0.05}
SmearGate
Per-dimension learned gate blending each token with the previous token embedding.
parameters: null
weight tying
Tied embeddings are used and kept in FP16 passthrough during compression.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02,"warmup_momentum_start":0.92,"warmup_steps":1500}
AdamW
weight_decay: 0.01
momentum: null
other_params: {"tied_embed_lr":0.03,"scalar_lr":0.02}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":24,"during":"warmdown"}
Quantization
int5
bits: 5
scope: MLP weights
int6
bits: 6
scope: attention weights
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.01}
gradient clipping
parameters: {"norm":0.3}
pruning
parameters: {"magnitude_pruning":"3%"}
Novel Contributions
- 10-layer transformer with U-Net skip connections
- MLP 3x expansion with relu-squared activation
- BigramHash token-pair embedding augmentation
- SmearGate token blending mechanism
- Mixed int5/int6 quantization with per-row scaling
- 3% magnitude pruning before quantization
- SWA over 24 checkpoints during warmdown
- Audited seed=42 run with real train log and aligned submission artifacts