PR #390

closed

Record: Sponge Bath — TTT 8ep eval-only improvement (val_bpb: 1.1295)

by newjordanView on GitHub

val_bpb

1.1295

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.74 MB

Training Techniques

Quantization

int6 QAT

bits: 6

scope: all

Architecture

SmearGate

Uses SmearGate in the MLP stack as part of the base architecture.

parameters: null

BigramHash

Uses BigramHash with 2048 buckets as part of the base architecture.

parameters: {"buckets":2048}

MLP3x

3x MLP expansion.

parameters: {"expansion":3}

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

tied embeddings

Input and output embeddings are tied.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

full TTT

parameters: {"epochs":8,"learning_rate":0.002,"momentum":0.9}

Initialization

OrthoInit

Orthogonal initialization.

Novel Contributions

Increased test-time training from 3 to 8 epochs
Reduced evaluation stride from 64 to 32
Pure eval-time improvement with no architecture or training changes
Achieved a new record validation bpb of 1.1295