PR #296

open

[Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)

by sseanliuView on GitHub

val_bpb

1.1645

Architecture

Transformer

Optimizer

Muon

Artifact Size

12.7MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

SmearGate

Adds a small gating mechanism to inject local bigram context.

parameters: null

BigramHash

Adds hashed bigram features for local context modeling.

parameters: {"buckets":2048,"dim":128}

MLP3x

Uses a 3x MLP expansion in transformer blocks.

parameters: {"hidden":1536}

KV head count

Uses grouped-query attention with 4 KV heads.

parameters: {"kv_heads":4,"attention_heads":8}

tied embeddings

Uses FP16 tied embeddings.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: null

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings"}

Weight Averaging

SWA

parameters: {"checkpoints":3}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"eval_seq_len":2048}

Test-Time Training

LoRA TTT

parameters: {"rank":4,"learning_rate":0.001,"top_frac":0.02}

Other

other

Reptile meta-learning applied to MLP layers of the last 3 transformer blocks during the final training phase.

parameters: {"outer_step_scale":0.01,"inner_steps":3,"inner_lr":0.1,"meta_steps":1576}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Novel Contributions

Reptile meta-learning improves SmearGate models by about 0.011 BPB, outperforming naive TTT.
Error-guided TTT on the highest-loss windows is a negative result and does not improve validation loss.
A deeper 13-layer model can outperform a 10-layer baseline under sufficient compute on 8xH100.
Per-token loss analysis shows a heavy-tailed distribution where a small fraction of tokens accounts for a large share of total loss.
The submission analyzes whether meta-learned initialization can overcome SmearGate/TTT redundancy.