PR #204

open

Add record: INT6 10L SWA NorMuon, val_bpb=1.2320

by AkasxhView on GitHub

val_bpb

1.2320

Architecture

GPT

Optimizer

NorMuon

Artifact Size

14.2MB

Training Techniques

Quantization

int6

bits: 6

scope: all model weights

Architecture

tied embeddings

Uses tied input/output embeddings.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"layers":10,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_hidden":1088}

Optimizer

NorMuon

weight_decay: 0.02

momentum: null

other_params: {"beta2":0.95}

Weight Averaging

SWA

parameters: {"snapshots":50,"every_steps":200}

Compression

zlib

level: 9

Evaluation

sliding window eval

parameters: {"stride":64,"batch_seqs":32,"context_length":4096}

Sequence Length

sequence_length

train_length: 2048

eval_length: 4096

LR Schedule

warmdown

parameters: {"warmdown_iters":20000}

Regularization

weight decay

parameters: {"value":0.02}

Other

other

Aggressive warmdown from step 0 to encourage tighter weight distributions for quantization.

parameters: {"warmdown_iters":20000}

Novel Contributions

INT6 quantization enabling a larger 10-layer architecture within the 16MB budget
Stochastic Weight Averaging with 50 snapshots before quantization
NorMuon optimizer with decoupled weight decay
Aggressive warmdown schedule starting from step 0
Use of NTK RoPE evaluation at 4096 context, though it degraded post-quant performance