PR #96

open

Sliding Window + Long-Context Training: val_bpb=1.1764

val_bpb

1.1764

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.88MB

Training Techniques

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Evaluation

sliding window eval

parameters: {"stride":512,"context_length":2048}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

gradient clipping

parameters: {"norm":0.3}

Architecture

tied embeddings

Uses tied embedding parameters with a separate learning rate.

parameters: null

Training at 2048 tokens performs identically to 4096 tokens under sliding-window evaluation, so shorter training sequences are preferable within the time budget.
A narrow gradient clipping sweet spot was found for long-sequence training, with 0.3 outperforming other tested values.
Batch size 786,432 tokens was identified as the best tradeoff for training at 2048-token sequences.
Quantization-aware warmdown from an earlier PR reduces post-quantization penalty, but only at higher base learning rates.