PR #61

closed

warmdown-quantization val_bpb = 1.2154

val_bpb

1.2154

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.91MB

Training Techniques

Quantization

int8

bits: 8

scope: all

Architecture

tied embeddings

Keeps tok_emb.weight tied and stores it in fp16 during int8 export to reduce quantization damage.

parameters: null

RoPE

Uses NTK-RoPE extrapolation with an optimal evaluation length shorter than maximum context.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"backend_steps":5}

Evaluation

long context eval

parameters: {"context_length":1408}

Sequence Length

sequence_length

train_length: null

eval_length: 1408

LR Schedule

warmdown

parameters: {"warmdown_iters":20000}

Other

other

Uses FP16 tied embeddings during int8 export and reduces MLP hidden size to 992 to offset the added memory cost.

parameters: {"mlp_hidden":992}

Always-decaying LR schedule with WARMDOWN_ITERS=20000 to reduce post-training quantization penalty.
Keeping tied embeddings in fp16 during int8 export to preserve accuracy.
Using NTK-RoPE extrapolation at eval length 1408 as the best setting for well-trained models.
Finding an optimizer-warmdown interaction where MUON_BACKEND_STEPS=5 outperforms 7 under aggressive warmdown.