PR #397

open

Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364)

by translatingthenameView on GitHub

val_bpb

1.1364

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.65 MB

Training Techniques

Architecture

XSA

Exclusive Self Attention applied to the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Rotary positional embeddings applied partially.

parameters: {"dimensions":16}

SmearGate

Custom gating mechanism used in the model.

parameters: null

BigramHash

Bigram-based hashing component used in the model.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

LN scale

parameters: null

Quantization

QAT

bits: 6

scope: all

Compression

zstd

level: null

Initialization

OrthoInit

Orthogonal initialization strategy.

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"freeze_blocks":2,"momentum":0.9}

Evaluation

sliding window eval

parameters: {"stride":64,"batch_size":32,"adapt_every_batches":4}

Optimizer

SGD

weight_decay: null

momentum: 0

other_params: {"learning_rate":0.001,"rank_local":true}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":1500}

Novel Contributions

Dynamic evaluation during validation scoring using periodic SGD steps on sliding windows.
Combining dynamic evaluation with TTT on the SOTA pipeline without changing training.
Zero additional artifact cost while improving validation bpb.
Rank-local adaptation during evaluation with batched window scoring.