PR #1311

open

Non-record: 11L LeakyReLU² + EMA + LZMA Int6 (val_bpb: 1.1303, 2-seed mean)

by htrung1105View on GitHub
val_bpb
1.1303
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.87 MB

Training Techniques

Architecture
LeakyReLU
LeakyReLU(0.5)^2 MLP activation in the Transformer stack
parameters: {"slope":0.5,"squared":true}
U-Net skip connections
U-Net style skip connections across encoder/decoder layers
parameters: {"encoder_layers":5,"decoder_layers":6}
XSA
Exclusive Self-Attention applied to the last layers
parameters: {"layers":4}
Partial RoPE
Partial rotary position embeddings with NTK-aware scaling
parameters: {"dimensions":16,"total_dimensions":64}
BigramHash
Bigram hash embedding with reduced bucket count
parameters: {"buckets":2048,"dimensions":128}
TrigramHash
Trigram hash disabled
parameters: {"enabled":false}
Gated Attention
Gated attention disabled
parameters: {"enabled":false}
Value Residual
Value residual disabled
parameters: {"enabled":false}
weight tying
Tied input and output embeddings
parameters: null
SmearGate
SmearGate component used in the model
parameters: null
KV head count
Grouped query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.997,"start_frac":0.4}
SWA
parameters: {"every_steps":50}
Quantization
GPTQ-lite
bits: 6
scope: all large weight matrices
late QAT
bits: 6
scope: model
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":16}
temperature scaling
parameters: {"temperature":0.9}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"momentum_final":0.99,"momentum_warmup_steps":1500,"lr":0.025}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"lr":0.035}
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer_idx+1)"}
logit softcap
parameters: {"value":30}
Initialization
OrthoInit
Orthogonal initialization used for model initialization
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • LeakyReLU(0.5)^2 MLP activation
  • LZMA extreme compression replacing zlib/zstd
  • Temperature-scaled evaluation at T=0.90
  • GPTQ-lite per-row clip search for int6 quantization
  • Reduced BigramHash size to 2048 buckets
  • Combination of EMA, SWA, and late QAT in an 11-layer Transformer