PR #104

open

Non-record: Stacked hyperparameter tuning + eval2048 (RTX 5090, val_bpb 1.336)

by gwelinderView on GitHub

val_bpb

1.3358

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.8MB

Training Techniques

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.06}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Sequence Length

sequence_length

train_length: 1024

eval_length: 2048

Evaluation

long context eval

parameters: {"context_length":2048}

sliding window eval

parameters: {"stride":null}

Compression

zlib

level: null

Quantization

int8

bits: 8

scope: all

mixed int6/int8

bits: 6

scope: all block matrices

Architecture

depth recurrence

Support for reusing a smaller set of unique layers across multiple recurrent passes.

parameters: {"num_unique_layers":4,"num_recurrence":3}

Other

other

Alias-aware serialization to store shared weights once.

parameters: null

Novel Contributions

Identified that WARMDOWN_ITERS=1200 was broken under the 600s wallclock and fixed it by increasing to 3000.
Stacked multiple hyperparameter fixes to improve val_bpb without changing the architecture.
Decoupled training and evaluation sequence lengths, using train length 1024 and eval length 2048.
Added alias-aware serialization so shared weights are stored once.
Implemented mixed int6/int8 quantization support for block matrices.
Implemented sliding-window evaluation support.
Added depth recurrence support.
Ran extensive autoresearch over 40+ experiments and reported several negative results.