PR #619

open

non-record 16MB A100 SXM run (10L mixed int5/int6 + EMA + QAT)

val_bpb

1.4222

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,576,677 bytes

Training Techniques

Quantization

mixed int5/int6 with QAT

bits: null

scope: int5 for MLP weights, int6 for attention/bigram-sensitive weights

Architecture

BigramHash

BigramHash embedding added to model

parameters: {"BIGRAM_VOCAB_SIZE":10240,"BIGRAM_DIM":128}

MLP3x

3x MLP expansion

parameters: {"MLP_MULT":3,"NUM_LAYERS":10}

Weight Averaging

EMA

parameters: {"decay":0.9999}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"MATRIX_LR":0.02,"SCALAR_LR":0.04,"TIED_EMBED_LR":0.04}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":160}

Mixed quantization using int5 for MLP weights and int6 for attention/bigram-sensitive weights
Use of EMA (Exponential Moving Average) for export-time weights with high decay
Final-fraction QAT (Quantization Aware Training) with QAT_FINAL_FRAC=0.15
Incorporation of BigramHash embedding with large vocab size and dimension
3x MLP expansion in a 10-layer Transformer model
Use of Muon optimizer with specific learning rates and momentum tuning
Compression of final artifact under 16MB using int8+zlib