PR #547

open

Record: Int5/Int6+Zstd+MLP3x — mean val_bpb=1.1752 (10L, seq4096, sliding window)

by shajalahamedcseView on GitHub

val_bpb

1.1752

Architecture

Transformer

Optimizer

Muon

Artifact Size

≤ 16,000,000 B

Training Techniques

Quantization

int5/int6

bits: null

scope: MLP matrices (int5), attention matrices (int6), embeddings (int6)

Architecture

MLP3x

3x expansion of MLP hidden units from baseline 1024 to 1536 enabled by quantization savings

parameters: {"mlp_hidden_units":1536,"expansion_factor":3}

Compression

zstd

level: null

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

Evaluation

sliding window eval

parameters: {"stride":64}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3600}

Int5 quantization for MLP weight matrices enabling ~1.5MB savings
Int6 quantization for attention matrices balancing precision and size
Use of zstd compression replacing zlib for better compression ratio on quantized integer arrays
3x MLP expansion (hidden=1536) funded by quantization savings without exceeding 16MB artifact size
Training with sequence length 4096 and sliding window evaluation with stride 64 for full context scoring