PR #136

open

Record: Seq2048 training + eval (val_bpb=1.2101)

by ibarrajoView on GitHub
val_bpb
1.2101
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.87 MB

Training Techniques

Quantization
int8
bits: 8
scope: all
Architecture
tied embeddings
Uses tied input/output embeddings as part of the baseline configuration.
parameters: null
KV head count
Baseline configuration uses 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"matrix_lr":0.04}
Compression
zlib
level: null
Evaluation
long context eval
parameters: {"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":1200}

Novel Contributions

  • Train and evaluate at sequence length 2048 instead of 1024.
  • Use longer training context so evaluation is interpolation rather than RoPE extrapolation.
  • Maintain the same tokens per step while changing the number and length of sequences.