PR #1310

open

Non-record: Exact Sequence Matching + TTT on PR #549 (1.1177 BPB)

by cadenmcmannView on GitHub

val_bpb

1.1177

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,882,529 bytes

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":64}

exact sequence matching

parameters: {"min_order":8,"max_order":12,"lambda":0.15,"blend_cap":0.5}

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"all_blocks_unfrozen":true}

Architecture

LeakyReLU

MLP activation uses LeakyReLU(0.5)^2

parameters: {"negative_slope":0.5}

GQA

Uses grouped query attention with fewer KV heads than query heads

parameters: {"heads":8,"kv_heads":4}

XSA

XSA attention on the last 4 layers

parameters: {"layers":4}

BigramHash

Bigram hash embedding / vocabulary component

parameters: {"size":1536}

Compression

lzma

level: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"parameter_banking":true}

Weight Averaging

EMA

parameters: null

Sequence Length

sequence_length

train_length: null

eval_length: null

Novel Contributions

Exact eval-time N-gram sequence matching stacked on top of TTT
Demonstration that sequence matching and TTT are complementary
Improvement from 1.1195 BPB with TTT alone to 1.1177 BPB with sequence matching
Use of exact 8-12 token context caching to mix cached next-token predictions into model outputs