PR #1325

open

Record Submission: Poly5 Softcap + Z-Loss + YaRN + Zstd-22 + Stride-16 (on PR #549 stack)

by monisha-maxView on GitHub
val_bpb
1.3868
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
7.0 MB

Training Techniques

Regularization
logit softcap
parameters: {"type":"poly5"}
logit softcap
parameters: {"type":"z-loss","weight":0.0001}
adaptive focal loss
parameters: {"gamma":1}
Architecture
RoPE
YaRN positional encoding for improved frequency interpolation
parameters: {"max_len":2048}
BigramHash
Bigram vocabulary embedding component
parameters: {"size":1536}
SmearGate
SmearGate embedding/attention component
parameters: null
U-Net skip connections
U-Net style encoder-decoder skip connections with learned skip weights
parameters: {"encoders":5,"decoders":6}
LeakyReLU
LeakyReLU squared MLP activation
parameters: {"slope":0.5}
XSA
Cross/self attention variant used in the last 4 layers
parameters: {"last_layers":4}
VE128
Value embeddings at later layers
parameters: {"layers":[9,10]}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":16}
Other
other
FA3/FA2/SDPA fallback for broader GPU compatibility
parameters: null
other
Residual vector quantization using int6 base plus int4 residual
parameters: null
other
Progressive depth warmup with staged layer freezing/unfreezing during training
parameters: {"stages":3}
Quantization
mixed int6/int4
bits: 6
scope: weights
Test-Time Training
score-first TTT
parameters: {"epochs":3}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997}
Optimizer
AdamW
weight_decay: 0.04
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: null
eval_length: 2048

Novel Contributions

  • Adaptive focal cross-entropy loss
  • Residual vector quantization
  • Progressive depth warmup
  • Poly5 softcap
  • Z-loss regularization
  • YaRN positional encoding
  • zstd-22 compression
  • Sliding eval stride=16
  • FA3/FA2/SDPA fallback