PR #415

closed

Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (val_bpb=1.1216)

by EthanYangTWView on GitHub

val_bpb

1.1216

Architecture

Transformer

Optimizer

Adam

Artifact Size

15,704,756 bytes

Training Techniques

Quantization

QAT

bits: 6

scope: attention

QAT

bits: 5

scope: MLP

Architecture

XSA

Uses XSA in the last 4 layers.

parameters: {"layers":4}

SmearGate

Adds SmearGate to the MLP blocks.

parameters: null

BigramHash

Bigram hashing embedding/feature mechanism for bigram coverage.

parameters: {"buckets":12288}

Partial RoPE

Applies RoPE partially across the model.

parameters: {"train_length":16,"eval_length":64}

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Uses 3x MLP relu² blocks.

parameters: null

Initialization

OrthoInit

Orthogonal initialization.

Weight Averaging

SWA

parameters: {"tight":true,"every_steps":50,"first_8_blocks_averaged":true}

Compression

zstd

level: 22

Evaluation

stride-based eval

parameters: {"stride":32}

Test-Time Training

two-phase TTT

parameters: {"phase_1":{"method":"norm-only recalibration","epochs":100,"optimizer":"Adam","learning_rate":0.01,"unfrozen_params":"~22K"},"phase_2":{"method":"selective-freeze block adaptation","epochs":25,"optimizer":"SGD","learning_rate":0.005,"unfrozen_params":"~7.6M"}}

Regularization

layerwise LN scale

parameters: {"ln_scale":true}

weight decay

parameters: {"late_qat":0.04}

Other

other

FA3 Hopper attention used to speed up training and enable more steps within the time budget.

parameters: {"step_time_ms":84.65,"steps":6939}

Novel Contributions

FA3 Hopper attention for faster training
Two-phase test-time training with norm-only recalibration followed by selective-freeze block adaptation
Recalibration of activation distributions damaged by int6 quantization
Selective freezing to preserve SWA-averaged early blocks while adapting later blocks
Tight SWA combined with late QAT and pruning