PR #410

closed

Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (val_bpb=1.1216)

by EthanYangTWView on GitHub
val_bpb
1.1216
Architecture
Transformer
Optimizer
Adam
Artifact Size
15,762,005 bytes

Training Techniques

Quantization
QAT
bits: 6
scope: attention; int5 for MLP layers
Architecture
XSA
Uses XSA in the last 4 layers of an 11-layer Transformer.
parameters: {"layers":4}
SmearGate
MLP gating mechanism used in 3x MLP blocks.
parameters: null
MLP3x
Three-layer MLP blocks.
parameters: {"layers":3}
Partial RoPE
Applies RoPE partially across dimensions.
parameters: {"dimensions":"16/64"}
BigramHash
Bigram hashing feature for token pair coverage.
parameters: {"buckets":2048}
KV head count
Uses grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
SWA
parameters: {"frequency":"tight"}
Compression
zstd
level: 22
Evaluation
stride-based eval
parameters: {"stride":32}
Test-Time Training
two-phase TTT
parameters: {"phase_1":{"method":"norm-only recalibration","epochs":100,"optimizer":"Adam","learning_rate":0.01,"trainable_params":"LayerNorm weights, scales, final_norm"},"phase_2":{"method":"selective-freeze block adaptation","epochs":15,"optimizer":"SGD","learning_rate":0.003,"trainable_params":"last 2 transformer blocks, norms, scales, lm_head"}}
Initialization
OrthoInit
Orthogonal initialization used for model weights.
Regularization
layerwise LN scale
parameters: {"ln_scale":true}

Novel Contributions

  • Two-phase test-time training combining norm-only recalibration and selective-freeze block adaptation
  • Recalibration of activation distributions damaged by int6 quantization
  • Selective adaptation of the last two transformer blocks while preserving SWA-averaged early layers
  • Tight SWA combined with late QAT and pruning
  • Increased BigramHash bucket count and reduced evaluation stride