PR #166

open

Record: Long Context + All Optimizations submission

by chinesepoweredView on GitHub
val_bpb
1.1550
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Evaluation
sliding window eval
parameters: {"stride":64,"eval_seq_len":1024}
Quantization
fp16
bits: 16
scope: tied embeddings
Architecture
tied embeddings
Uses tied embedding export in FP16 to avoid quantization error compounding through input/output paths.
parameters: null
Transformer depth
Increases model depth to 10 transformer layers.
parameters: {"layers":10}
Optimizer
Muon
weight_decay: 0.02
momentum: 0.99
other_params: {"extended_momentum_warmup":{"start":0.92,"steps":1500}}
Initialization
spectral init
Overtone spectral embedding initialization with power-law spectrum shaping.
resid mix
Phase-transition residual mixing with sigmoid-scheduled initialization.
Sequence Length
sequence_length
train_length: 2048
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":3600,"momentum_warmup_steps":1500,"momentum_warmup_start":0.92}
Regularization
weight decay
parameters: {"value":0.02}
Other
other
Uses a smaller training batch of 393K tokens to increase optimizer steps per wallclock second.
parameters: {"train_batch_tokens":393000}

Novel Contributions

  • Combines long-context training with the strongest prior SOTA evaluation and quantization tricks.
  • Uses 2048-token training sequences instead of 1024 to improve pre-quantization quality.
  • Applies conservative learning rates and higher Muon momentum to reduce quantization gap.
  • Uses FP16 tied embedding export to avoid int8 quantization error compounding.
  • Keeps sliding-window evaluation while adding training optimizations from the Seq4096 submission.
  • Increases model depth to 10 layers while staying within budget.