PR #1130

open

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean)

by GusanidasView on GitHub
val_bpb
1.1140
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.88 MB

Training Techniques

Architecture
Residual lambdas
Learnable per-sublayer residual scaling with exponential recency bias across layers.
parameters: {"init":"sqrt(1.1)"}
BigramHash
Expanded hash-based embedding table to reduce collision ratio.
parameters: {"buckets":6144}
VE196
Increased value embedding dimension on selected layers.
parameters: {"dimensions":196,"layers":[5,9,10]}
XSA
Exclusive self-attention applied to the last 7 layers.
parameters: {"layers":7}
LeakyReLU
LeakyReLU squared activation in the MLP.
parameters: {"squared":true,"negative_slope":0.5}
Cache + backout
Cached layer 7 hidden state is subtracted back via a learnable gate before the LM head.
parameters: {"layer":7}
Partial RoPE
Rotary positional embeddings applied partially.
parameters: {"dimensions":"16/64"}
MLP3x
Three-times MLP expansion.
parameters: null
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":"16/64"}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"split_early_late_lr":true,"matrix_lr_early":0.036,"matrix_lr_late":0.044,"scalar_lr_early":0.028,"scalar_lr_late":0.018}
Quantization
GPTQ
bits: 6
scope: full model
late QAT
bits: null
scope: full model
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Coprime-stride multi-shard data loader for batch diversity.
parameters: null
other
Train-data GPTQ calibration performed within the training budget.
parameters: {"calibration_time_seconds":14}
other
MiLe margin loss with entropy-weighted cross-entropy and gamma=0.75.
parameters: {"gamma":0.75}

Novel Contributions

  • Residual lambdas for per-sublayer residual scaling
  • Split early/late learning-rate banks for Muon and Adam parameters
  • Train-budget GPTQ calibration using training data
  • Coprime-stride multi-shard data loader
  • Expanded BigramHash table
  • Larger value embeddings on selected layers
  • XSA extended to the last 7 layers
  • MiLe margin loss
  • Cache-and-backout residual path
  • Flash Attention 3 integration
  • Tuned batch size for the training budget