PR #1512

open

Record: Bank QAT + seq4096 + SWA w=256 + QK-Gain 2.5 + PKO — val_bpb 1.1117 (3-seed mean)

by ItssshikharView on GitHub
val_bpb
1.1117
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,116,257 bytes

Training Techniques

Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Quantization
STE QAT
bits: 6
scope: all F.linear params
GPTQ
bits: 6
scope: all weights
Architecture
QK-Gain
Higher initial gain for QK-normalization
parameters: {"gain":2.5}
Partial RoPE
Shift stationary RoPE dimensions of K forward by 1 position
parameters: {"offset":1}
SWA
Sliding window attention on lower layers, full attention on upper layers
parameters: {"window_size":256,"full_attention_layers":5}
weight tying
Tied embeddings
parameters: null
BigramHash
Bigram hash embedding
parameters: {"buckets":3072,"dim":112}
U-Net skip connections
Encoder-decoder skip connections
parameters: {"encoder":5,"decoder":6}
SmearGate
SmearGate module
parameters: null
ReLU²
relu-squared activation
parameters: null
Compression
lzma
level: 9
Weight Averaging
SWA
parameters: {"window_size":256}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_start_momentum":0.92}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"learning_rate":0.025}
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections
Regularization
logit softcap
parameters: {"value":30}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}

Novel Contributions

  • seq4096 training
  • bank-weight QAT on all F.linear parameters
  • QK-Gain 2.5
  • partial key offset for stationary RoPE dimensions
  • SWA with window size 256
  • full Hessian GPTQ calibration
  • lzma compression