PR #1446

open

Non-record: 11L gated Krylov + AR GPTQ int6 + lzma, 1.09596 BPB

by LauraGomezjuradoView on GitHub
val_bpb
1.0960
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,925,099 bytes

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"gated_krylov_correction":true,"alpha":0.05,"eta_threshold":0.03,"warmup_steps":1000,"decision_every":100,"every":2,"hutchinson_samples":2,"rank_max":4,"rank_scale":1}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: 9
Architecture
XSA
XSA attention used across all layers
parameters: {"layers":11}
BigramHash
BigramHash embedding component
parameters: {"dimension":112,"size":3072}
SmearGate
Position-mixing gate
parameters: null
VE128
VE128 used in layers 9-10
parameters: {"layers":[9,10]}
Partial RoPE
RoPE applied to a subset of dimensions
parameters: {"dims":"16/64"}
U-Net skip connections
Encoder-decoder skip connections
parameters: null
LeakyReLU
LeakyReLU squared MLP activation
parameters: {"squared":true,"negative_slope":0.5}
MLP3x
Three-layer MLP
parameters: {"width":1536}
Weight Averaging
EMA + Tight SWA
parameters: {"decay":0.997,"swa_every":50}
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
pruning
parameters: {"type":"selective ±1","criterion":"reconstruction error"}

Novel Contributions

  • Gated Krylov residual correction added selectively on top of Muon
  • Hutchinson-based nonnormality detection using the commutator W^T W - W W^T
  • Adaptive Krylov rank correction blended into the Muon direction
  • Autoregressive self-generated full-Hessian GPTQ int6 calibration
  • Selective ±1 pruning to fit the 16MB artifact cap
  • Strong 11-layer SentencePiece GPT stack with XSA, BigramHash, SmearGate, VE128, partial RoPE, and U-Net skips