PR #573

closed

Record: 11L XSA4 + Multi-Pass Streaming Score-First Legal TTT (3-seed mean val_bpb=1.0523)

by SarimsaljookView on GitHub

val_bpb

1.0523

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.92 MB

Training Techniques

Test-Time Training

score-first multi-pass legal TTT

parameters: {"passes":3,"learning_rate":0.0005,"batch_size":"16 sequences x 2048 tokens","optimizer":"AdamW"}

Architecture

XSA

Exclusive Self-Attention in the last 4 layers

parameters: {"layers":4}

Partial RoPE

Rotary position embeddings applied to a subset of head dimensions

parameters: {"dimensions":16,"total_head_dims":64}

MLP3x

3x MLP expansion

parameters: {"hidden_dim":1536}

tied embeddings

Input and output embeddings are tied

parameters: null

U-Net skip connections

Encoder-decoder skip connections in the transformer stack

parameters: {"encoder_layers":5,"decoder_layers":6}

SmearGate

Custom gating mechanism combined with BigramHash

parameters: null

BigramHash

Bigram hashing module with bucketed representation

parameters: {"buckets":2048,"dim":128}

Quantization

int6 QAT

bits: 6

scope: MLP + attention

STE QAT

bits: 6

scope: late QAT when LR scale < 0.15

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025}

AdamW

weight_decay: 0

momentum: null

other_params: {"used_for":"TTT and embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency_steps":50,"start_condition_lr_scale":0.2}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"score_first":true,"multi_pass":true}

LR Schedule

cosine decay

parameters: {"to_zero":true,"warmdown_steps":3500}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

Other

other

Rotary cache backprop fix by cloning cached cos/sin tensors to avoid inference tensor autograd errors

parameters: null

Novel Contributions

Multi-pass streaming score-first legal test-time training with shifted data orderings
Per-token final score chosen as the minimum NLL across three adaptation trajectories
Rotary cache backprop fix using cloned cached cos/sin tensors to enable TTT through partial RoPE
Legal TTT evaluation that scores tokens before any training on those tokens