PR #635
openNon-record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100 SXM)
by aryanbhosaleView on GitHub
val_bpb
1.1330
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Quantization
int6 uniform + GPTQ-lite
bits: 6
scope: all except tied embeddings
Architecture
MLP 3.5x with LeakyReLU(0.5)^2
Expanded MLP hidden dimension with squared LeakyReLU activation
parameters: {"expansion_factor":3.5,"activation":"LeakyReLU(0.5)^2","hidden_dim":1792}
SmearGate
Gating mechanism applied in architecture
parameters: null
BigramHash
Bigram hashing with 10240 buckets and 128 dimensions
parameters: {"buckets":10240,"dim":128}
TrigramHash
Trigram hashing with 4096 buckets and 128 dimensions
parameters: {"buckets":4096,"dim":128}
Value Residual (ResFormer)
Caching and blending value vectors from layer 0 via learned lambda
parameters: null
Gated Attention
Per-head sigmoid gating with bias initialized to 4.0
parameters: null
XSA all 11 layers
Exclusive self-attention applied on all layers
parameters: {"layers":11}
Partial RoPE
Rotary positional embeddings applied partially on 16 of 64 head dimensions
parameters: {"dimensions":"16/64"}
Tied FP16 embeddings
Weight tying of embeddings in FP16 precision
parameters: null
U-Net skip connections
Skip connections inspired by U-Net architecture
parameters: null
Initialization
OrthoInit
Orthogonal initialization of weights
Optimizer
Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"momentum_schedule":"0.92->0.99 over 1500 steps"}
Adam
weight_decay: null
momentum: null
other_params: {"lr_embeddings":0.035,"lr_scalars":0.03}
Weight Averaging
EMA
parameters: {"decay":0.997}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"weight_decay":0.04}
gradient clipping
parameters: {"clip_value":0.3}
Other
training_techniques
Late QAT via STE applied during final 15% of training
parameters: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Use of MLP 3.5x expansion with LeakyReLU(0.5)^2 activation
- Integration of SmearGate gating mechanism
- Combination of BigramHash and TrigramHash embeddings
- Value Residual (ResFormer) caching and blending of layer 0 values
- Gated Attention with per-head sigmoid gating and bias initialization
- Exclusive self-attention (XSA) applied on all 11 layers
- Partial RoPE applied on a subset of head dimensions (16/64)
- Late Quantization Aware Training (QAT) via STE in final 15% of training
- Use of Muon optimizer with momentum scheduling
- Orthogonal initialization (OrthoInit) of weights
- U-Net style skip connections in Transformer architecture
- Int6 uniform quantization combined with GPTQ-lite and per-row 5-percentile clipping