PR #926

open

Record: 11L EMA + GPTQ-lite + LeakyReLU^2 + QAT@0.15

by NandhuRajRKView on GitHub
val_bpb
0.8705
Architecture
Transformer
Optimizer
Artifact Size
15825448 bytes

Training Techniques

Architecture
MLP3x
Transformer with 3x MLP expansion
parameters: {"expansion":3}
GQA
Uses 8 attention heads and 4 KV heads
parameters: {"attention_heads":8,"kv_heads":4}
LeakyReLU
MLP activation changed to LeakyReLU(0.5)^2
parameters: {"slope":0.5,"squared":true}
XSA
XSA on late layers
parameters: {"layers":"late"}
Partial RoPE
Uses partial rotary positional embeddings
parameters: null
Regularization
LN scale
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
late QAT
bits: null
scope: model
GPTQ-lite
bits: 6
scope: all
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Compression
zstd
level: null

Novel Contributions

  • LeakyReLU(0.5)^2 MLP activation in place of relu^2
  • EMA-based 11-layer Transformer record attempt
  • GPTQ-lite int6 export with roundtrip verification
  • Late QAT at threshold 0.15
  • Portability fixes for non-FA3 environments