PR #543
openNon-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)
by rarceView on GitHub
val_bpb
1.1804
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.95MB
Training Techniques
Quantization
mixed int6/int8 with GPTQ-lite
bits: null
scope: layers 1-9 int6, layers 0 and 10 int8, FP16 embeddings
Architecture
Partial RoPE
Position-free 75% of head dims, rotary embeddings on 16/64 dims
parameters: {"rotary_dims":16,"total_dims":64,"position_free_ratio":0.75}
LN Scale
LayerNorm output scaled by 1/sqrt(layer_idx+1) to damp deeper layers
parameters: null
XSA
Exclusive Self Attention on last 4 layers removes self-value bias via GQA-aware orthogonal projection
parameters: {"layers":4}
Shared VE128
Value embedding injection shared across layers 9 and 10
parameters: {"embedding_dim":128,"layers":[9,10]}
SmearGate
Learned per-dim gate blending current and previous token embeddings
parameters: null
U-Net skip connections
5 encoder and 6 decoder skip connections
parameters: {"encoder_skips":5,"decoder_skips":6}
Tied embeddings
Input and output embeddings are tied
parameters: null
MLP hidden size
Reduced MLP hidden dimension to 1408 for faster training and artifact size fit
parameters: {"hidden_dim":1408}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025,"momentum_warmup":"0.92 to 0.99 over 1500 steps"}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"embed_lr":0.035,"scalar_lr":0.025}
Weight Averaging
Tight SWA
parameters: {"scale_threshold":0.2,"checkpoints_averaged":6,"checkpoint_interval":50,"quality_penalty":"zero"}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"grad_clip":0.3}
Late QAT
parameters: {"activation_lr_scale_threshold":0.1,"step_activated":4070,"lr_halved_on_activation":true}
Compression
zstd
level: 22
Novel Contributions
- MLP hidden=1408 vs 1536: narrower MLP fits in 16MB artifact size while enabling ~33% more training steps, resulting in better val_bpb despite reduced per-step capacity
- Tight SWA with scale threshold <0.2 eliminates quality penalty seen in standard SWA
- Late QAT activation timing at lr_scale <0.1 avoids disrupting Muon momentum and provides minimal but effective quantization-aware training adaptation
- GPTQ-lite clip ratio search is a zero training cost method that improves quantization reconstruction error by selecting optimal per-tensor clipping