PR #1675

open

11L + LN Scale + BigramHash 3072x112 + GPTQ: val_bpb=1.1451

by jayzuccarelliView on GitHub
val_bpb
1.1451
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.79 MB

Training Techniques

Regularization
LN scale
parameters: {"attn_scale_init":"1/sqrt(layer_idx+1)","mlp_scale_init":"1/sqrt(layer_idx+1)"}
Architecture
BigramHash
Larger bigram hash table with smaller per-bucket dimension to keep parameter budget similar while improving short-range context modeling.
parameters: {"vocab_size":3072,"dimension":112}
XSA
Cross-position self-attention applied to all layers.
parameters: {"layers":11}
LeakyReLU
MLP activation uses leaky_relu(x, 0.5)^2.
parameters: {"negative_slope":0.5}
Quantization
GPTQ
bits: 6
scope: all
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_steps":0}
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"stride":256}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • LN Scale initialization for attention and MLP scales
  • BigramHash 3072x112 with larger vocab and same parameter budget
  • Full Hessian GPTQ with autoregressive self-generated calibration
  • XSA applied to all 11 layers
  • LeakyReLU(0.5)^2 MLP activation
  • EMA before GPTQ