PR #1529

open

Record: ImprovedParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523

by msisovicView on GitHub
val_bpb
1.0744
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.96 MB

Training Techniques

Architecture
U-Net skip connections
Decoder U-Net skips are written only into lane0 to preserve a cheaper and more stable skip path.
parameters: null
other
Parallel residual split-lane decoder where attention and MLP read from different lanes and both outputs are accumulated into both lanes at the end of the block.
parameters: {"parallel_residual_start":8}
Test-Time Training
full TTT
parameters: {"enabled":true,"learning_rate":0.01}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: null

Novel Contributions

  • Reintroduced fuller parallel residual routing into the split-lane decoder
  • Kept GPT-J-style parallel-in-time updates while restoring richer learned routing between attention and MLP lanes
  • Preserved the cheaper and more stable decoder U-Net skip path by writing skips only into lane0
  • Moved PARALLEL_RESIDUAL_START from 7 to 8
  • Required cutlass_evt_fusion to recover full throughput under the wallclock cap