PR #1501

open

[Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation

by SPTholeView on GitHub

val_bpb

1.1159

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.08 MB

Training Techniques

Architecture

XSA

Cross-layer shared attention with banked Q/O and KV weights shared across all 11 layers.

parameters: {"layers":11}

BigramHash

Hash-based bigram embedding table with position-conditional splitting of buckets by word-start vs within-word tokens.

parameters: {"table_shape":"4096x64","split_buckets":2047}

TrigramHash

Adds trigram lookup reusing the same hash embedding table.

parameters: null

RoPE

Partial rotary position embeddings applied to only part of each attention head.

parameters: {"dimensions":16,"of_total":64}

U-Net skip connections

11-layer U-Net GPT with encoder-decoder skip connections.

parameters: {"layers":11}

GQA

Grouped-query attention with 8 query heads and 4 key-value heads.

parameters: {"q_heads":8,"kv_heads":4}

weight tying

Input token embeddings are tied to the output head weights.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"used_for":"matrix weights"}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings and scalar parameters"}

Weight Averaging

EMA

parameters: {"decay":0.998}

SWA

parameters: {"interval":50,"start_phase":"warmdown"}

Quantization

GPTQ

bits: 6

scope: attention + MLP weights

GPTQ

bits: 8

scope: embeddings

late QAT

bits: 8

scope: model

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"chunk_tokens":65536}

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","momentum":0.9,"learning_rate":0.004,"epochs":4}

LR Schedule

cosine decay

parameters: {"warmdown_start_step":2200}

Other

other

FOMAML-style meta-TTT during training, enabled every 4 steps in the parent experiment and ablated off in exp105a.

parameters: {"every":4,"enabled_in_this_pr":false}

Novel Contributions

Position-conditional bigram hashing that splits hash buckets by word-start vs within-word tokens
Trigram lookup added without extra parameters by reusing the same hash table
Controlled ablation showing inherited FOMAML meta-TTT contributes only noise-level improvement
Analysis that same-batch FOMAML meta-TTT is misaligned with score-first-then-adapt TTT
Weight-space/subspace analysis showing Muon can rotate weights while preserving function