PR #379

open

Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257)

by dannywillowliu-uchiView on GitHub
val_bpb
1.1257
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.99 MB

Training Techniques

Quantization
GPTQ-lite
bits: 6
scope: all weights
Architecture
MLP3x
3x MLP expansion with relu-squared activation
parameters: {"expansion":3}
XSA
Efficient Partial XSA on the last 4 layers
parameters: {"last_n_layers":4}
RoPE
Partial RoPE with NTK-aware scaling
parameters: {"dimensions":"16/64"}
SmearGate
SmearGate gating mechanism
parameters: null
BigramHash
BigramHash with 2048 buckets and dim=128
parameters: {"buckets":2048,"dim":128}
KV head count
Grouped-query attention with 8 heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Value Embedding
Shared value embedding used in later layers
parameters: {"dim":128,"layers":[9,10]}
Weight Averaging
SWA
parameters: {"every_steps":50,"checkpoint_count":12,"scale_threshold":0.2}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
self-distillation TTT
parameters: {"temperature":2,"freeze_blocks":4,"epochs":2,"learning_rate":0.001}
Initialization
Orthogonal init
Orthogonal initialization with projection scaling
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
layerwise LN scale
parameters: {"scale_rule":"1/sqrt(layer_idx+1)"}
Other
other
Late QAT with STE int6 applied when LR scale < 0.1
parameters: {"lr_scale_threshold":0.1}

Novel Contributions

  • GPTQ-lite: per-layer optimal clip percentile search during int6 quantization
  • Self-distillation TTT using a frozen teacher to preserve XSA attention patterns
  • Late QAT with STE int6 during training