PR #1275

open

Non-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change)

val_bpb

1.1492

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

U-Net skip connections

Asymmetric encoder-decoder split in the hourglass architecture, changing from a default 50/50 split to 1 encoder layer and the rest decoder layers.

parameters: {"num_encoder_layers":1}

XSA

Uses XSA as part of the SOTA stack.

parameters: null

BigramHash

Uses BigramHash as part of the SOTA stack.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: null

SWA

parameters: null

Quantization

QAT

bits: 8

scope: all

Test-Time Training

full TTT

parameters: null

One-line asymmetric encoder-decoder split change from num_layers // 2 to 1 encoder layer
Empirical finding that shifting capacity to the decoder improves BPB monotonically across tested configurations
Validation of the asymmetric split on baseline, SOTA RTX 5090, and 8xH100 runs
Reported 1.1492 pre-quant val_bpb on 8xH100 before the run was cut short