PR #376
closedRecord: 11L Next-Gen Stack + Custom Kernels, val_bpb=1.1399
by anthony-maioView on GitHub
val_bpb
1.1399
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.79MB
Training Techniques
Architecture
MLP3x
3x expansion MLP with ReLU² activation
parameters: {"hidden":1536}
XSA
Exclusive Self Attention applied to the last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied to only part of the head dimension with NTK-aware scaling
parameters: {"rope_dims":16,"total_dims":64,"base":50000}
SmearGate
Learned sigmoid token blending gate
parameters: {"parameters":512}
BigramHash
Hash embedding for token-pair features
parameters: {"buckets":2048,"dim":128}
KV head count
Grouped-query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_start":0.92,"warmup_end":0.99,"warmup_steps":1500}
Weight Averaging
SWA
parameters: {"checkpoint_average":7,"scale_threshold":0.2}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with muP scaling
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmup_steps":1500,"warmup_start":0.92,"warmup_end":0.99}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer_idx+1)"}
Quantization
int5
bits: 5
scope: mixed precision weights
Novel Contributions
- 11-layer transformer with a competitive stack achieving 1.1399 val_bpb
- Exclusive Self Attention on the last 4 layers
- Partial RoPE with NTK-aware base scaling
- SmearGate learned token blending
- BigramHash token-pair feature embedding
- Int5 mixed precision with late QAT STE
- GPTQ-lite clip search during compression
- Muon optimizer with custom warmup schedule
- Tight SWA checkpoint averaging
- Custom Triton/CUDA kernel pipeline for future speedups