PR #816

open

Record submission: Poly5 Softcap + BigramHash(3072) + Wider GPTQ-lite…

by jimliu741523View on GitHub
val_bpb
1.1194
Architecture
11-layer Transformer
Optimizer
Parallel Muon
Artifact Size

Training Techniques

Architecture
BigramHash
Increased bigram hash embedding vocabulary from 2048 to 3072.
parameters: {"vocab_size":3072}
Partial RoPE
Applied partial rotary positional embeddings to part of the dimensions.
parameters: {"dimensions":"16/64"}
XSA
Used XSA attention in the last 4 layers.
parameters: {"layers":4}
tied embeddings
Input and output embeddings are tied.
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: all
STE QAT
bits: 6
scope: all
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_momentum":0.92,"warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"interval_steps":50,"condition":"scale < 0.2"}
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64,"temperature":0.95}
Test-Time Training
score-first TTT
parameters: {"epochs":3,"optimizer":"SGD","all_blocks_unfrozen":true}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
z-loss
parameters: {"weight":0.0001}
LN scale
parameters: {"scale_rule":"1/sqrt(layer+1)"}

Novel Contributions

  • Poly-5 softcap replacing tanh for better compile fusion
  • BigramHash vocabulary increased from 2048 to 3072
  • Wider GPTQ-lite percentile search with 9 candidates
  • Temperature scaling at evaluation with T=0.95
  • Z-loss regularization with weight 1e-4
  • LZMA preset 9 compression