PR #272

open

Non-record: 10L mixed int5/int6 export reaches ~10.4MB with strong throughput

by simon-marcusView on GitHub
val_bpb
1.2427
Architecture
Transformer
Optimizer
SGD
Artifact Size
10.4MB

Training Techniques

Quantization
mixed int5/int6/int8
bits: null
scope: MLP matrices int5, attention matrices int6, elsewhere int8
Architecture
weight tying
Tied output and input embeddings
parameters: null
Optimizer
SGD
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.08,"scalar_lr":0.04,"embed_lr":0.05}
Compression
zlib
level: null
Test-Time Training
tiny eval-time SGD
parameters: {"targets":["q_gain","attn_scale","mlp_scale","resid_mix","skip_weights"]}
LR Schedule
warmdown
parameters: {"warmdown_iters":500,"warmup_steps":20}
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Mixed-precision export with int5 for MLP matrices and int6 for attention matrices
  • Tiny eval-time adaptation on a small control-parameter subset
  • Demonstration of a valid 10L submission with strong throughput and much smaller artifact size
  • Exploration of the size/quality frontier using aggressive mixed quantization