PR #52

RECORDclosed

New SOTA attempt (val_bpb=1.2014)

by spokane-wayView on GitHub

val_bpb

1.2014

Architecture

Transformer

Optimizer

Muon

Artifact Size

15868326 bytes

Training Techniques

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: null

Sequence Length

sequence_length

train_length: 4096

eval_length: null

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"momentum_warmup_steps":1500,"momentum_warmup_start":0.92,"tied_embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02,"train_batch_tokens":393216}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Novel Contributions

Longer training context with sequence length 4096
Aggressively tuned Muon optimizer momentum and learning rates
Reduced training batch tokens to increase update frequency
Extended momentum warmup to stabilize high-momentum training
Longer warmdown schedule for the shorter wallclock-limited run
Int8 quantized roundtrip submission with improved post-quantization BPB