PR #1070

open

Non-record: Aweb Ultimate — 1.1190 BPB (10min 8×H100, independent PR #549 reproduction)

by manfromnowhere143View on GitHub
val_bpb
1.1190
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,948,863 bytes

Training Techniques

Architecture
LeakyReLU
LeakyReLU squared activation
parameters: {"squared":true,"negative_slope":0.5}
XSA
Cross-layer attention applied to the last layers
parameters: {"layers":4}
Partial RoPE
Rotary positional encoding applied to a subset of head dimensions
parameters: {"head_dims":16,"total_head_dims":64}
SmearGate
Input enrichment gate
parameters: null
BigramHash
Bigram hash input feature
parameters: {"size":2048}
ValueEmbedding
Value embedding input enrichment
parameters: {"dimensions":128}
U-Net skip connections
Encoder-decoder skip connections with learned skip weights
parameters: null
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"phases":3,"overlapped_comms":true}
Quantization
GPTQ-lite
bits: 6
scope: MLP+attn
STE QAT
bits: 6
scope: late QAT
Compression
lzma
level: null
Test-Time Training
score-first TTT
parameters: {"epochs":3,"optimizer":"SGD","learning_rate":0.002,"momentum":0.9}
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Independent reproduction of PR #549 SOTA stack
  • 11-layer 512-dimensional Transformer with the full proven stack
  • LeakyReLU squared activation
  • XSA on the last 4 layers
  • Partial RoPE on 16/64 head dimensions
  • EMA plus SWA weight averaging
  • Parallel Muon optimizer with overlapped communications
  • GPTQ-lite mixed int6/int8 quantization with LZMA compression
  • SmearGate, BigramHash, and ValueEmbedding input enrichment
  • Legal score-first test-time training
  • U-Net skip connections with learned skip weights
  • Late QAT with int6 STE