PR #530

open

Non-record: Basis Block Interpolation (novel negative result) + Hyperparameter Sweep (MATRIX_LR=0.03 improves SOTA by 0.059 bpb)

by j420View on GitHub

val_bpb

1.4963

Architecture

Transformer

Optimizer

Muon

Artifact Size

10.88MB

Training Techniques

Architecture

depth recurrence

Basis Block Interpolation stores K basis transformer blocks and reuses them across N effective layers with learned depth embeddings to create more effective layers with fewer parameters.

parameters: {"basis_blocks":5,"unrolls":3,"effective_layers":15,"dim":576}

Optimizer

Muon

weight_decay: 0.02

momentum: 0.995

other_params: null

Weight Averaging

SWA

parameters: {"start_frac":0.3}

Regularization

weight decay

parameters: {"weight_decay_values_tested":[0.02,0.06]}

LR Schedule

warmdown

parameters: {"warmdown_iters":4000}

Evaluation

stride-based eval

parameters: {"EVAL_STRIDE":0,"description":"Standard evaluation, not sliding window, for fast iteration"}

Novel Contributions

Basis Block Interpolation (BBI): a novel architecture that reuses a small set of basis transformer blocks with learned depth embeddings to create more effective layers, documented as an informative negative result due to torch.compile speed bottleneck.
Systematic hyperparameter sweep on SOTA model identifying MATRIX_LR=0.03 as a significant improvement over default 0.02, improving val_bpb by 0.059.