PR #104
openNon-record: Stacked hyperparameter tuning + eval2048 (RTX 5090, val_bpb 1.336)
by gwelinderView on GitHub
val_bpb
1.3358
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.8MB
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.06}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Sequence Length
sequence_length
train_length: 1024
eval_length: 2048
Evaluation
long context eval
parameters: {"context_length":2048}
sliding window eval
parameters: {"stride":null}
Compression
zlib
level: null
Quantization
int8
bits: 8
scope: all
mixed int6/int8
bits: 6
scope: all block matrices
Architecture
depth recurrence
Support for reusing a smaller set of unique layers across multiple recurrent passes.
parameters: {"num_unique_layers":4,"num_recurrence":3}
Other
other
Alias-aware serialization to store shared weights once.
parameters: null
Novel Contributions
- Identified that WARMDOWN_ITERS=1200 was broken under the 600s wallclock and fixed it by increasing to 3000.
- Stacked multiple hyperparameter fixes to improve val_bpb without changing the architecture.
- Decoupled training and evaluation sequence lengths, using train length 1024 and eval length 2048.
- Added alias-aware serialization so shared weights are stored once.
- Implemented mixed int6/int8 quantization support for block matrices.
- Implemented sliding-window evaluation support.
- Added depth recurrence support.
- Ran extensive autoresearch over 40+ experiments and reported several negative results.