PR #415
closedRecord: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (val_bpb=1.1216)
by EthanYangTWView on GitHub
val_bpb
1.1216
Architecture
Transformer
Optimizer
Adam
Artifact Size
15,704,756 bytes
Training Techniques
Quantization
QAT
bits: 6
scope: attention
QAT
bits: 5
scope: MLP
Architecture
XSA
Uses XSA in the last 4 layers.
parameters: {"layers":4}
SmearGate
Adds SmearGate to the MLP blocks.
parameters: null
BigramHash
Bigram hashing embedding/feature mechanism for bigram coverage.
parameters: {"buckets":12288}
Partial RoPE
Applies RoPE partially across the model.
parameters: {"train_length":16,"eval_length":64}
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Uses 3x MLP relu² blocks.
parameters: null
Initialization
OrthoInit
Orthogonal initialization.
Weight Averaging
SWA
parameters: {"tight":true,"every_steps":50,"first_8_blocks_averaged":true}
Compression
zstd
level: 22
Evaluation
stride-based eval
parameters: {"stride":32}
Test-Time Training
two-phase TTT
parameters: {"phase_1":{"method":"norm-only recalibration","epochs":100,"optimizer":"Adam","learning_rate":0.01,"unfrozen_params":"~22K"},"phase_2":{"method":"selective-freeze block adaptation","epochs":25,"optimizer":"SGD","learning_rate":0.005,"unfrozen_params":"~7.6M"}}
Regularization
layerwise LN scale
parameters: {"ln_scale":true}
weight decay
parameters: {"late_qat":0.04}
Other
other
FA3 Hopper attention used to speed up training and enable more steps within the time budget.
parameters: {"step_time_ms":84.65,"steps":6939}
Novel Contributions
- FA3 Hopper attention for faster training
- Two-phase test-time training with norm-only recalibration followed by selective-freeze block adaptation
- Recalibration of activation distributions damaged by int6 quantization
- Selective freezing to preserve SWA-averaged early blocks while adapting later blocks
- Tight SWA combined with late QAT and pruning