PR #306

open

Non-record: QAT Int5/Int6 on #1 architecture (1.14476 BPB)

by xuafengView on GitHub
val_bpb
1.1448
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,793,963

Training Techniques

Quantization
STE QAT
bits: 5
scope: MLP
STE QAT
bits: 6
scope: attention
Architecture
BigramHash
Hash embedding table used alongside token embeddings
parameters: {"vocab_size":10240,"dim":128}
SmearGate
Gating component in the architecture
parameters: null
MLP3x
3x expansion MLP
parameters: {"expansion":3,"hidden_dim":1536}
tied embeddings
Input and output embeddings are tied
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50,"averaged_checkpoints":24}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"targets":["Q","V","LM head"]}
Initialization
OrthoInit
Orthogonal initialization
Regularization
3% magnitude pruning
parameters: {"prune_fraction":0.03}

Novel Contributions

  • Applied STE fake-quantization QAT on top of the #1 architecture
  • Used mixed int5 MLP and int6 attention quantization during training
  • Compared QAT against post-training quantization and found post-training quantization plus SWA performed better
  • Explored trigram hash embeddings as an additional feature
  • Implemented TTT LoRA adapters for potential test-time adaptation