PR #1291

open

Record: Vocab4096 + MLP4.0x + SLOT - val_bpb 1.0925 (3-seed mean)

by dentity007View on GitHub
val_bpb
1.0925
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.95 MB

Training Techniques

Architecture
GQA
11-layer transformer with 8 query heads and 4 KV heads
parameters: {"layers":11,"d_model":512,"q_heads":8,"kv_heads":4}
XSA
XSA applied to all layers
parameters: {"layers":11}
MLP4x
MLP width expanded to 4.0x
parameters: {"multiplier":4}
U-Net skip connections
Sigmoid-gated U-Net style skip connections
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
AdamW
weight_decay: 0.085
momentum: null
other_params: {"lr":0.02}
LR Schedule
warmdown
parameters: {"warmdown_percent":66.7}
Quantization
GPTQ
bits: 6
scope: all
Compression
brotli
level: 11
Evaluation
sliding window eval
parameters: null
Other
other
SLOT eval-time optimization using a per-batch additive delta optimized on frozen hidden states before scoring
parameters: {"steps":8,"learning_rate":0.005,"delta_shape":[1,1,512]}
Regularization
logit softcap
parameters: null

Novel Contributions

  • SLOT eval-time optimization with per-batch delta updates on frozen hidden states
  • Vocab4096 tokenizer with MLP 4.0x architecture
  • Full Hessian GPTQ quantization with int6 and brotli compression
  • 3-seed verified record result with mean val_bpb 1.0925