PR #1529

RECORDopen

Record: ImprovedParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523

by msisovicView on GitHub

val_bpb

1.0744

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.96 MB

Training Techniques

Architecture

U-Net skip connections

Decoder U-Net skips are written only into lane0 to preserve a cheaper and more stable skip path.

parameters: null

other

Parallel residual split-lane decoder where attention and MLP read from different lanes and both outputs are accumulated into both lanes at the end of the block.

parameters: {"parallel_residual_start":8}

Test-Time Training

full TTT

parameters: {"enabled":true,"learning_rate":0.01}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: null

Novel Contributions

Reintroduced fuller parallel residual routing into the split-lane decoder
Kept GPT-J-style parallel-in-time updates while restoring richer learned routing between attention and MLP lanes
Preserved the cheaper and more stable decoder U-Net skip path by writing skips only into lane0
Moved PARALLEL_RESIDUAL_START from 7 to 8
Required cutlass_evt_fusion to recover full throughput under the wallclock cap