PR #530
openNon-record: Basis Block Interpolation (novel negative result) + Hyperparameter Sweep (MATRIX_LR=0.03 improves SOTA by 0.059 bpb)
by j420View on GitHub
val_bpb
1.4963
Architecture
Transformer
Optimizer
Muon
Artifact Size
10.88MB
Training Techniques
Architecture
depth recurrence
Basis Block Interpolation stores K basis transformer blocks and reuses them across N effective layers with learned depth embeddings to create more effective layers with fewer parameters.
parameters: {"basis_blocks":5,"unrolls":3,"effective_layers":15,"dim":576}
Optimizer
Muon
weight_decay: 0.02
momentum: 0.995
other_params: null
Weight Averaging
SWA
parameters: {"start_frac":0.3}
Regularization
weight decay
parameters: {"weight_decay_values_tested":[0.02,0.06]}
LR Schedule
warmdown
parameters: {"warmdown_iters":4000}
Evaluation
stride-based eval
parameters: {"EVAL_STRIDE":0,"description":"Standard evaluation, not sliding window, for fast iteration"}
Novel Contributions
- Basis Block Interpolation (BBI): a novel architecture that reuses a small set of basis transformer blocks with learned depth embeddings to create more effective layers, documented as an informative negative result due to torch.compile speed bottleneck.
- Systematic hyperparameter sweep on SOTA model identifying MATRIX_LR=0.03 as a significant improvement over default 0.02, improving val_bpb by 0.059.