PR #166

open

Record: Long Context + All Optimizations submission

by chinesepoweredView on GitHub

val_bpb

1.1550

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":64,"eval_seq_len":1024}

Quantization

fp16

bits: 16

scope: tied embeddings

Architecture

tied embeddings

Uses tied embedding export in FP16 to avoid quantization error compounding through input/output paths.

parameters: null

Transformer depth

Increases model depth to 10 transformer layers.

parameters: {"layers":10}

Optimizer

Muon

weight_decay: 0.02

momentum: 0.99

other_params: {"extended_momentum_warmup":{"start":0.92,"steps":1500}}

Initialization

spectral init

Overtone spectral embedding initialization with power-law spectrum shaping.

resid mix

Phase-transition residual mixing with sigmoid-scheduled initialization.

Sequence Length

sequence_length

train_length: 2048

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_iters":3600,"momentum_warmup_steps":1500,"momentum_warmup_start":0.92}

Regularization

weight decay

parameters: {"value":0.02}

Other

other

Uses a smaller training batch of 393K tokens to increase optimizer steps per wallclock second.

parameters: {"train_batch_tokens":393000}

Novel Contributions

Combines long-context training with the strongest prior SOTA evaluation and quantization tricks.
Uses 2048-token training sequences instead of 1024 to improve pre-quantization quality.
Applies conservative learning rates and higher Muon momentum to reduce quantization gap.
Uses FP16 tied embedding export to avoid int8 quantization error compounding.
Keeps sliding-window evaluation while adding training optimizations from the Seq4096 submission.
Increases model depth to 10 layers while staying within budget.