PR #410

closed

Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (val_bpb=1.1216)

by EthanYangTWView on GitHub

val_bpb

1.1216

Architecture

Transformer

Optimizer

Adam

Artifact Size

15,762,005 bytes

Training Techniques

Quantization

QAT

bits: 6

scope: attention; int5 for MLP layers

Architecture

XSA

Uses XSA in the last 4 layers of an 11-layer Transformer.

parameters: {"layers":4}

SmearGate

MLP gating mechanism used in 3x MLP blocks.

parameters: null

MLP3x

Three-layer MLP blocks.

parameters: {"layers":3}

Partial RoPE

Applies RoPE partially across dimensions.

parameters: {"dimensions":"16/64"}

BigramHash

Bigram hashing feature for token pair coverage.

parameters: {"buckets":2048}

KV head count

Uses grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

SWA

parameters: {"frequency":"tight"}

Compression

zstd

level: 22

Evaluation

stride-based eval

parameters: {"stride":32}

Test-Time Training

two-phase TTT

parameters: {"phase_1":{"method":"norm-only recalibration","epochs":100,"optimizer":"Adam","learning_rate":0.01,"trainable_params":"LayerNorm weights, scales, final_norm"},"phase_2":{"method":"selective-freeze block adaptation","epochs":15,"optimizer":"SGD","learning_rate":0.003,"trainable_params":"last 2 transformer blocks, norms, scales, lm_head"}}

Initialization

OrthoInit

Orthogonal initialization used for model weights.

Regularization

layerwise LN scale

parameters: {"ln_scale":true}

Novel Contributions

Two-phase test-time training combining norm-only recalibration and selective-freeze block adaptation
Recalibration of activation distributions damaged by int6 quantization
Selective adaptation of the last two transformer blocks while preserving SWA-averaged early layers
Tight SWA combined with late QAT and pruning
Increased BigramHash bucket count and reduced evaluation stride