PR #136

open

Record: Seq2048 training + eval (val_bpb=1.2101)

val_bpb

1.2101

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.87 MB

Training Techniques

Quantization

int8

bits: 8

scope: all

Architecture

tied embeddings

Uses tied input/output embeddings as part of the baseline configuration.

parameters: null

KV head count

Baseline configuration uses 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"matrix_lr":0.04}

Compression

zlib

level: null

Evaluation

long context eval

parameters: {"context_length":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":1200}

Train and evaluate at sequence length 2048 instead of 1024.
Use longer training context so evaluation is interpolation rather than RoPE extrapolation.
Maintain the same tokens per step while changing the number and length of sequences.