PR #272

open

Non-record: 10L mixed int5/int6 export reaches ~10.4MB with strong throughput

by simon-marcusView on GitHub

val_bpb

1.2427

Architecture

Transformer

Optimizer

SGD

Artifact Size

10.4MB

Training Techniques

Quantization

mixed int5/int6/int8

bits: null

scope: MLP matrices int5, attention matrices int6, elsewhere int8

Architecture

weight tying

Tied output and input embeddings

parameters: null

Optimizer

SGD

weight_decay: null

momentum: null

other_params: {"matrix_lr":0.08,"scalar_lr":0.04,"embed_lr":0.05}

Compression

zlib

level: null

Test-Time Training

tiny eval-time SGD

parameters: {"targets":["q_gain","attn_scale","mlp_scale","resid_mix","skip_weights"]}

LR Schedule

warmdown

parameters: {"warmdown_iters":500,"warmup_steps":20}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Mixed-precision export with int5 for MLP matrices and int6 for attention matrices
Tiny eval-time adaptation on a small control-parameter subset
Demonstration of a valid 10L submission with strong throughput and much smaller artifact size
Exploration of the size/quality frontier using aggressive mixed quantization