PR #104

open

Non-record: Stacked hyperparameter tuning + eval2048 (RTX 5090, val_bpb 1.336)

by gwelinderView on GitHub
val_bpb
1.3358
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.8MB

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.06}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Sequence Length
sequence_length
train_length: 1024
eval_length: 2048
Evaluation
long context eval
parameters: {"context_length":2048}
sliding window eval
parameters: {"stride":null}
Compression
zlib
level: null
Quantization
int8
bits: 8
scope: all
mixed int6/int8
bits: 6
scope: all block matrices
Architecture
depth recurrence
Support for reusing a smaller set of unique layers across multiple recurrent passes.
parameters: {"num_unique_layers":4,"num_recurrence":3}
Other
other
Alias-aware serialization to store shared weights once.
parameters: null

Novel Contributions

  • Identified that WARMDOWN_ITERS=1200 was broken under the 600s wallclock and fixed it by increasing to 3000.
  • Stacked multiple hyperparameter fixes to improve val_bpb without changing the architecture.
  • Decoupled training and evaluation sequence lengths, using train length 1024 and eval length 2048.
  • Added alias-aware serialization so shared weights are stored once.
  • Implemented mixed int6/int8 quantization support for block matrices.
  • Implemented sliding-window evaluation support.
  • Added depth recurrence support.
  • Ran extensive autoresearch over 40+ experiments and reported several negative results.