PR #294

closed

[Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)

by sseanliuView on GitHub

val_bpb

1.1645

Architecture

Transformer

Optimizer

Muon

Artifact Size

12.7MB

Training Techniques

Quantization

int6

bits: 6

scope: model weights

Architecture

SmearGate

Uses SmearGate in the base recipe.

parameters: null

BigramHash

Includes BigramHash as part of the model recipe.

parameters: null

MLP3x

Uses 3x MLP blocks in the base recipe.

parameters: null

weight tying

Uses tied embeddings / tied weights in the model.

parameters: null

depth recurrence

Recycles 3 unique blocks multiple times to create 12 effective layers.

parameters: {"unique_blocks":3,"effective_layers":12,"repetitions_per_block":4}

tied embeddings

Uses FP16 tied embeddings.

parameters: {"vocab_size":1024,"dimension":768}

KV head count

Uses 6 KV heads with 12 attention heads.

parameters: {"heads":12,"kv_heads":6}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

SWA

parameters: null

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

LoRA TTT

parameters: {"rank":4}

Initialization

spectral init

Uses overtone spectral initialization for FP16 tied embeddings.

resid mix

Uses phase-transition residual mix initialization.

Regularization

weight decay

parameters: {"value":0.04}

Other

other

Reptile meta-learning applied during the last 20% of training time on the last 3 blocks' MLPs.

parameters: {"meta_steps":1576,"scope":"last 3 blocks' MLPs","training_fraction":0.2}

other

Error-guided test-time adaptation that concentrates adaptation budget on the highest-loss tokens/windows.

parameters: {"top_fraction":0.02}

other

U-Net skip connections across encoder and decoder halves.

parameters: null

Novel Contributions

Reptile meta-learning improves SmearGate models by 0.011 BPB over naive TTT.
Error-guided TTT was evaluated and found to be a negative result.
Per-token loss distribution analysis on the full validation set showing the hardest 2.7% of tokens account for about 15% of total loss.
A 13-layer model outperformed a 10-layer model on 8xH100 despite fewer training steps.
Uses ALBERT-style weight sharing with 3 unique blocks recycled into 12 effective layers.
Introduces per-iteration learned scalars to break symmetry between recycled block applications.