PR #398

open

Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed)

by felipe-parodiView on GitHub

val_bpb

1.1213

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.53 MB

Training Techniques

Architecture

SmearGate

Adds SmearGate to the model architecture.

parameters: null

BigramHash

Uses a BigramHash embedding/component with vocabulary size 2048 and dimension 128.

parameters: {"vocab_size":2048,"dim":128}

Partial RoPE

Applies rotary position embeddings to only part of the dimensions.

parameters: {"dimensions":16}

MLP3x

Uses a 3x-width MLP block.

parameters: {"hidden":1536}

KV head count

Uses grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

weight tying

Uses tied embeddings.

parameters: null

Initialization

OrthoInit

Orthogonal initialization.

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

mixed int6

bits: 6

scope: all

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"epochs":20,"learning_rate":0.008,"momentum":0.9,"freeze_blocks":0}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"scalar_lr":0.025,"tied_embed_lr":0.035}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

EMA(0.997) combined with aggressive 20-epoch test-time training
All blocks unfrozen during TTT (freeze_blocks=0) was critical for best performance
15-run ablation study identifying negative results such as late QAT, memory tokens, warmdown=20000, and PPM-C blending
Removal of XSA to save step time and gain additional training steps within the wall-clock budget
Mixed int6 quantization with zstd-22 compression under the 16MB artifact constraint