PR #306

open

Non-record: QAT Int5/Int6 on #1 architecture (1.14476 BPB)

by xuafengView on GitHub

val_bpb

1.1448

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,793,963

Training Techniques

Quantization

STE QAT

bits: 5

scope: MLP

STE QAT

bits: 6

scope: attention

Architecture

BigramHash

Hash embedding table used alongside token embeddings

parameters: {"vocab_size":10240,"dim":128}

SmearGate

Gating component in the architecture

parameters: null

MLP3x

3x expansion MLP

parameters: {"expansion":3,"hidden_dim":1536}

tied embeddings

Input and output embeddings are tied

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every":50,"averaged_checkpoints":24}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"targets":["Q","V","LM head"]}

Initialization

OrthoInit

Orthogonal initialization

Regularization

3% magnitude pruning

parameters: {"prune_fraction":0.03}

Novel Contributions

Applied STE fake-quantization QAT on top of the #1 architecture
Used mixed int5 MLP and int6 attention quantization during training
Compared QAT against post-training quantization and found post-training quantization plus SWA performed better
Explored trigram hash embeddings as an additional feature
Implemented TTT LoRA adapters for potential test-time adaptation