PR #63

RECORDclosed

Record: 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (val_bpb 1.1598)

by yahya010View on GitHub

val_bpb

1.1598

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.56MB

Training Techniques

Architecture

Transformer depth

Increased model depth from 9 to 10 transformer layers.

parameters: {"layers":10}

MLP3x

Widened MLP hidden size to 1344 (about 2.625x model dimension) enabled by quantization and compression savings.

parameters: {"hidden_size":1344,"multiplier":2.625}

tied embeddings

Used FP16 tied embedding passthrough.

parameters: null

Quantization

STE QAT

bits: 6

scope: all 2D block weights

fp16

bits: 16

scope: tied embeddings passthrough

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"warmup_from":0.92,"warmup_steps":1500}

LR Schedule

warmdown

parameters: {"warmdown_steps":3600}

Regularization

gradient clipping

parameters: {"max_norm":0.3}

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

STE int6 QAT that eliminates the quantization gap
Full int6 quantization of block weights with zstd-22 compression
Wider MLP hidden size enabled by compression savings
10-layer Transformer variant
Muon momentum tuning with warmup from 0.92 to 0.99
Sliding window evaluation with stride 64
FP16 tied embedding passthrough