PR #1275

open

Non-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change)

by ranausmanaiView on GitHub
val_bpb
1.1492
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
U-Net skip connections
Asymmetric encoder-decoder split in the hourglass architecture, changing from a default 50/50 split to 1 encoder layer and the rest decoder layers.
parameters: {"num_encoder_layers":1}
XSA
Uses XSA as part of the SOTA stack.
parameters: null
BigramHash
Uses BigramHash as part of the SOTA stack.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Quantization
QAT
bits: 8
scope: all
Test-Time Training
full TTT
parameters: null

Novel Contributions

  • One-line asymmetric encoder-decoder split change from num_layers // 2 to 1 encoder layer
  • Empirical finding that shifting capacity to the decoder improves BPB monotonically across tested configurations
  • Validation of the asymmetric split on baseline, SOTA RTX 5090, and 8xH100 runs
  • Reported 1.1492 pre-quant val_bpb on 8xH100 before the run was cut short