PR #1275
openNon-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change)
by ranausmanaiView on GitHub
val_bpb
1.1492
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
U-Net skip connections
Asymmetric encoder-decoder split in the hourglass architecture, changing from a default 50/50 split to 1 encoder layer and the rest decoder layers.
parameters: {"num_encoder_layers":1}
XSA
Uses XSA as part of the SOTA stack.
parameters: null
BigramHash
Uses BigramHash as part of the SOTA stack.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Quantization
QAT
bits: 8
scope: all
Test-Time Training
full TTT
parameters: null
Novel Contributions
- One-line asymmetric encoder-decoder split change from num_layers // 2 to 1 encoder layer
- Empirical finding that shifting capacity to the decoder improves BPB monotonically across tested configurations
- Validation of the asymmetric split on baseline, SOTA RTX 5090, and 8xH100 runs
- Reported 1.1492 pre-quant val_bpb on 8xH100 before the run was cut short