PR #442

closed

Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027)

by sjp611View on GitHub

val_bpb

1.1027

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.75 MB

Training Techniques

Architecture

SmearGate

Adds SmearGate to the model.

parameters: null

BigramHash

Uses BigramHash for additional token interaction features.

parameters: {"vocab_size":2048,"dim":128}

RoPE

Uses partial rotary positional embeddings.

parameters: {"dimensions":16,"base_dimensions":64}

MLP3x

Uses a 3x-width MLP block.

parameters: {"hidden":1536}

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

weight tying

Uses tied embeddings.

parameters: null

Initialization

OrthoInit

Orthogonal initialization.

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

AdamW

weight_decay: 0

momentum: null

other_params: {"learning_rate":0.0005}

Test-Time Training

full TTT

parameters: {"learning_rate":0.0005,"epochs":10,"optimizer":"AdamW"}

Evaluation

sliding window eval

parameters: {"stride":64}

Quantization

int6

bits: 6

scope: mixed

Compression

zstd

level: 22

Regularization

layerwise LN scale

parameters: null

Novel Contributions

Replaced SGD with AdamW for test-time training
Reduced TTT epochs from 20 to 10 while improving validation BPB
Achieved a new record mean val_bpb of 1.1027
Reduced TTT runtime from about 260s to about 157s
Used the same 11-layer EMA-based setup as PR #398 with only a small optimizer change