PR #348

open

Submission/qat bigram12k stride32

by EthanYangTWView on GitHub
val_bpb
1.1444
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.90MB

Training Techniques

Quantization
STE QAT
bits: 5
scope: MLP
STE QAT
bits: 6
scope: attention
Architecture
BigramHash
Increased bigram hash bucket count to improve bigram coverage.
parameters: {"buckets":12288,"bigram_dim":128}
SmearGate
Uses 3x MLP with SmearGate in the transformer block.
parameters: {"mlp_multiplier":3}
Weight Averaging
SWA
parameters: {"every_steps":25}
Evaluation
stride-based eval
parameters: {"stride":32}
Compression
zstd
level: 22
Regularization
magnitude pruning
parameters: {"sparsity":0.05}

Novel Contributions

  • Applied QAT with STE fake quantization to reduce post-quantization degradation
  • Used mixed precision quantization with int5 MLP and int6 attention
  • Expanded BigramHash from 10240 to 12288 buckets
  • Reduced evaluation stride from 64 to 32
  • Applied 5% magnitude pruning
  • Used SWA during training