PR #128

open

Record: Int6 MLP3x + STE QAT + Sliding Window (val_bpb=1.1594)

by rsavittView on GitHub

val_bpb

1.1594

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,162,777 bytes

Training Techniques

Quantization

int6

bits: 6

scope: MLP and attention weights; tied embeddings kept fp16

STE QAT

bits: 6

scope: weights

Architecture

MLP3x

Expanded MLP hidden size to 3x baseline using int6 savings

parameters: {"mlp_mult":3,"hidden":1536}

tied embeddings

Kept tied token embedding/output head in fp16 passthrough to avoid quantization penalty

parameters: {"tie_embeddings":1}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":4096}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Other

other

Tuned training dynamics with wallclock-limited training and large batch token count

parameters: {"train_batch_tokens":393216,"max_wallclock_seconds":600}

Novel Contributions

Int6 per-row quantization plus zstd-22 compression to fit a wider model within the 16MB budget
3x MLP expansion enabled by quantization savings
STE fake int6 quantization-aware training to improve post-quantization robustness
fp16 tied embedding passthrough to preserve output head quality
Sliding window evaluation with stride 64 for near-full-context scoring
Co-optimized training dynamics including Muon momentum tuning and warmdown schedule