PR #231

open

Record: SEQ_LEN=4096 training

by lenguyen1807View on GitHub
val_bpb
1.2036
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
tied embeddings
Input and output embeddings are tied for the model.
parameters: null
RoPE
NTK-aware RoPE scaling for longer-context evaluation/training.
parameters: null
KV head count
Uses grouped-query style attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
phase-transition resid_mix
Applies phase-transition residual mixing in the architecture.
parameters: null
Optimizer
Muon
weight_decay: 0.02
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.04}
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
overtone init
Uses overtone embedding initialization.
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
extended warmup
parameters: {"warmup_steps":1500}
Quantization
mixed-bit lowbit export
bits: null
scope: selected block weights

Novel Contributions

  • Long-context training with sequence length 4096
  • Sliding-window evaluation with stride 64
  • FP16 tied embedding export
  • Overtone embedding initialization
  • Phase-transition residual mixing
  • NTK-aware RoPE scaling
  • Lower learning rates with higher Muon momentum and extended warmup
  • Optional mixed-bit lowbit export for deeper models