val_bpb
1.1318
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.7 MB
Training Techniques
Quantization
mixed int6/int8
bits: 6
scope: MLP and attention int6; embeddings int8
Architecture
MLP3x
Uses a 3x MLP with hidden size 1536 and relu² activation.
parameters: {"hidden_size":1536}
SmearGate
Learned token blending gate added to the residual stream.
parameters: {"parameters":512}
BigramHash
Bigram hash embedding for token-pair features into the residual stream.
parameters: {"bigram_vocab_size":2048}
RoPE
Sequence uses NTK-aware RoPE.
parameters: null
FlashAttention 3
Uses direct flash_attn_func calls for attention.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3000,"adamw_weight_decay":0.04}
Weight Averaging
SWA
parameters: {"checkpoint_avg_count":8,"warmdown_lr_scale_threshold":0.5,"checkpoint_interval_steps":200}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal plus muP-scaled initialization on large matrices.
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adamw_wd":0.04}
Novel Contributions
- Increased depth to 11 transformer layers to gain capacity while staying under the artifact limit via int6 compression.
- Applied weight decay 0.04 to keep weights quantization-friendly and improve int6 compression.
- Used stochastic weight averaging over roughly 8 checkpoints during warmdown.
- Evaluated with sliding-window stride 64 for near-full context scoring.
- Reduced bigram vocabulary from 4096 to 2048 to save artifact space with minimal BPB impact.
- Kept and combined prior techniques including OrthoInit + muP, 3x MLP, SmearGate, BigramHash, and FlashAttention 3.