PR #296

open

[Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)

by sseanliuView on GitHub
val_bpb
1.1645
Architecture
Transformer
Optimizer
Muon
Artifact Size
12.7MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
SmearGate
Adds a small gating mechanism to inject local bigram context.
parameters: null
BigramHash
Adds hashed bigram features for local context modeling.
parameters: {"buckets":2048,"dim":128}
MLP3x
Uses a 3x MLP expansion in transformer blocks.
parameters: {"hidden":1536}
KV head count
Uses grouped-query attention with 4 KV heads.
parameters: {"kv_heads":4,"attention_heads":8}
tied embeddings
Uses FP16 tied embeddings.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: null
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings"}
Weight Averaging
SWA
parameters: {"checkpoints":3}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"eval_seq_len":2048}
Test-Time Training
LoRA TTT
parameters: {"rank":4,"learning_rate":0.001,"top_frac":0.02}
Other
other
Reptile meta-learning applied to MLP layers of the last 3 transformer blocks during the final training phase.
parameters: {"outer_step_scale":0.01,"inner_steps":3,"inner_lr":0.1,"meta_steps":1576}
Sequence Length
sequence_length
train_length: null
eval_length: 2048

Novel Contributions

  • Reptile meta-learning improves SmearGate models by about 0.011 BPB, outperforming naive TTT.
  • Error-guided TTT on the highest-loss windows is a negative result and does not improve validation loss.
  • A deeper 13-layer model can outperform a 10-layer baseline under sufficient compute on 8xH100.
  • Per-token loss analysis shows a heavy-tailed distribution where a small fraction of tokens accounts for a large share of total loss.
  • The submission analyzes whether meta-learned initialization can overcome SmearGate/TTT redundancy.