PR #294
closed[Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)
by sseanliuView on GitHub
val_bpb
1.1645
Architecture
Transformer
Optimizer
Muon
Artifact Size
12.7MB
Training Techniques
Quantization
int6
bits: 6
scope: model weights
Architecture
SmearGate
Uses SmearGate in the base recipe.
parameters: null
BigramHash
Includes BigramHash as part of the model recipe.
parameters: null
MLP3x
Uses 3x MLP blocks in the base recipe.
parameters: null
weight tying
Uses tied embeddings / tied weights in the model.
parameters: null
depth recurrence
Recycles 3 unique blocks multiple times to create 12 effective layers.
parameters: {"unique_blocks":3,"effective_layers":12,"repetitions_per_block":4}
tied embeddings
Uses FP16 tied embeddings.
parameters: {"vocab_size":1024,"dimension":768}
KV head count
Uses 6 KV heads with 12 attention heads.
parameters: {"heads":12,"kv_heads":6}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
SWA
parameters: null
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":4}
Initialization
spectral init
Uses overtone spectral initialization for FP16 tied embeddings.
resid mix
Uses phase-transition residual mix initialization.
Regularization
weight decay
parameters: {"value":0.04}
Other
other
Reptile meta-learning applied during the last 20% of training time on the last 3 blocks' MLPs.
parameters: {"meta_steps":1576,"scope":"last 3 blocks' MLPs","training_fraction":0.2}
other
Error-guided test-time adaptation that concentrates adaptation budget on the highest-loss tokens/windows.
parameters: {"top_fraction":0.02}
other
U-Net skip connections across encoder and decoder halves.
parameters: null
Novel Contributions
- Reptile meta-learning improves SmearGate models by 0.011 BPB over naive TTT.
- Error-guided TTT was evaluated and found to be a negative result.
- Per-token loss distribution analysis on the full validation set showing the hardest 2.7% of tokens account for about 15% of total loss.
- A 13-layer model outperformed a 10-layer model on 8xH100 despite fewer training steps.
- Uses ALBERT-style weight sharing with 3 unique blocks recycled into 12 effective layers.
- Introduces per-iteration learned scalars to break symmetry between recycled block applications.