PR #147

open

Record/smaller batch sota, val_bpb 1.16314679 (post-quant, int6+zlib, sliding eval)

by ankitmalooView on GitHub

val_bpb

1.1631

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,934,552 bytes

Training Techniques

Quantization

int6

bits: 6

scope: all

int8

bits: 8

scope: all

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: null

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

Other

other

Tighter int8 clipping to retain more of the weight distribution tail.

parameters: {"int8_clip_percentile":99.99995}

other

Higher-precision per-row scales to reduce scale quantization error.

parameters: {"int8_per_row_scale_dtype":"float32"}

Novel Contributions

Tighter int8 clipping percentile to preserve more tail weights during quantization
Higher-precision per-row int8 scales using float32
Strong Muon optimizer tuning with momentum warmup and extended warmdown
Sliding window evaluation
Smaller batch training setup on seq4096 trunk