PR #1319

open

Record: 11L LeakyReLU² XSA-all GPTQ-AR SLOT64 — 0.6951 BPB

by canivelView on GitHub
val_bpb
0.6951
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.69 MB

Training Techniques

Architecture
LeakyReLU
LeakyReLU squared MLP activation used in the feedforward network.
parameters: {"layers":11,"model_dim":512,"mlp_expansion":3}
XSA
Exclusive Self Attention applied to all layers.
parameters: {"layers":11}
BigramHash
Bigram hash embedding used as an auxiliary architectural component.
parameters: {"buckets":3072,"dimensions":112}
U-Net skip connections
Encoder-decoder style skip connections in the model.
parameters: {"encoder_layers":5,"decoder_layers":6}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Quantization
GPTQ
bits: 6
scope: MLP+attention
int8
bits: 8
scope: embeddings
late QAT
bits: null
scope: all
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"lr":0.025,"momentum_schedule_end":0.99,"momentum_schedule_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"embed_lr":0.035,"scalar_lr":0.025}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval":50}
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Test-Time Training
score-first TTT
parameters: {"steps":64,"warmstart_alpha":0.85,"learning_rate_start":0.01,"learning_rate_end":0.001}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • SLOT with warmstart across adjacent evaluation windows
  • Autoregressive self-generated GPTQ calibration tokens
  • Full Hessian GPTQ with Cholesky error compensation and Hessian-diagonal column reordering
  • Int6 MLP+attention quantization with Int8 embeddings under a 16 MB artifact budget