PR #1415

open

Record: SP4096 + 3-Layer Recurrence + GPTQ Embeddings + SDClip + ETLB — val_bpb 1.0913 (3-seed mean)

val_bpb
1.0913
Architecture
Transformer
Optimizer
Muon
Artifact Size
~14.75 MB

Training Techniques

Quantization
GPTQ
bits: 8
scope: embeddings
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
Architecture
depth recurrence
3-layer depth recurrence applied to layers 3, 4, and 5
parameters: {"layers":[3,4,5]}
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Eval-time logit bias optimized on context tokens during sliding-window evaluation
parameters: {"method":"ETLB","steps":5,"learning_rate":0.05,"clip":3,"warm_start":true}
Regularization
weight decay
parameters: {"weight_decay":0.095}
LR Schedule
higher LR compensation
parameters: {"matrix_lr":0.022}

Novel Contributions

  • SP4096 vocabulary
  • GPTQ quantization on embeddings
  • SDClip quantization clipping
  • 3-layer depth recurrence
  • Eval-time logit bias (ETLB)
  • QK-Gain 5.0
  • LZMA code wrapper for artifact savings