PR #1501

open

[Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation

by SPTholeView on GitHub
val_bpb
1.1159
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.08 MB

Training Techniques

Architecture
XSA
Cross-layer shared attention with banked Q/O and KV weights shared across all 11 layers.
parameters: {"layers":11}
BigramHash
Hash-based bigram embedding table with position-conditional splitting of buckets by word-start vs within-word tokens.
parameters: {"table_shape":"4096x64","split_buckets":2047}
TrigramHash
Adds trigram lookup reusing the same hash embedding table.
parameters: null
RoPE
Partial rotary position embeddings applied to only part of each attention head.
parameters: {"dimensions":16,"of_total":64}
U-Net skip connections
11-layer U-Net GPT with encoder-decoder skip connections.
parameters: {"layers":11}
GQA
Grouped-query attention with 8 query heads and 4 key-value heads.
parameters: {"q_heads":8,"kv_heads":4}
weight tying
Input token embeddings are tied to the output head weights.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_for":"matrix weights"}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings and scalar parameters"}
Weight Averaging
EMA
parameters: {"decay":0.998}
SWA
parameters: {"interval":50,"start_phase":"warmdown"}
Quantization
GPTQ
bits: 6
scope: attention + MLP weights
GPTQ
bits: 8
scope: embeddings
late QAT
bits: 8
scope: model
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"chunk_tokens":65536}
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","momentum":0.9,"learning_rate":0.004,"epochs":4}
LR Schedule
cosine decay
parameters: {"warmdown_start_step":2200}
Other
other
FOMAML-style meta-TTT during training, enabled every 4 steps in the parent experiment and ablated off in exp105a.
parameters: {"every":4,"enabled_in_this_pr":false}

Novel Contributions

  • Position-conditional bigram hashing that splits hash buckets by word-start vs within-word tokens
  • Trigram lookup added without extra parameters by reusing the same hash table
  • Controlled ablation showing inherited FOMAML meta-TTT contributes only noise-level improvement
  • Analysis that same-batch FOMAML meta-TTT is misaligned with score-first-then-adapt TTT
  • Weight-space/subspace analysis showing Muon can rotate weights while preserving function