PR #1086
openTrack A: 11L U-Net + BigramHash + SmearGate + Partial RoPE + QAT (1.1349 bpb)
by OmrigotliebView on GitHub
val_bpb
1.1349
Architecture
Transformer
Optimizer
Muon
Artifact Size
16.33MB
Training Techniques
Architecture
U-Net skip connections
11-layer U-Net transformer with encoder-decoder skip connections
parameters: {"layers":11,"encoder_layers":5,"decoder_layers":6}
GQA
Grouped query attention with fewer KV heads than query heads
parameters: {"query_heads":8,"kv_heads":4,"head_dim":64}
MLP3x
MLP with 3x expansion
parameters: {"expansion":3}
LeakyReLU
LeakyReLU squared activation in the MLP
parameters: {"slope":0.5}
BigramHash
BigramHash embeddings with projection
parameters: {"buckets":8192,"projection_dim":128}
SmearGate
Learned previous-token blend after embedding normalization
parameters: null
XSA
Exclusive self-attention applied in the last 4 layers
parameters: {"layers":4}
Partial RoPE
Only part of the head dimension uses rotary position embeddings
parameters: {"rotated_dims":16,"total_dims":64}
vocab_bias
Learned per-token logit prior
parameters: null
Regularization
z-loss
parameters: {"weight":0.0001}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_momentum_start":0.92,"warmup_momentum_end":0.99,"warmup_steps":1500,"adam_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: 6
scope: last 15% of warmdown
mixed int6/int8
bits: null
scope: embeddings, MLP, attention
GPTQ-lite
bits: null
scope: per-row
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"steps":3500,"wallclock_adaptive":true}
Novel Contributions
- 11-layer U-Net transformer with skip connections
- BigramHash embeddings
- SmearGate token blending
- Partial RoPE
- Exclusive self-attention in the last 4 layers
- Mixed int6/int8 GPTQ-lite quantization
- Late QAT during warmdown
- Muon optimizer with EMA