PR #1722

open

Record: Trinity SLOT v3 + Pre-Quant TTT — val_bpb 0.65802 (3-seed mean)

by deborahnelson8788726View on GitHub

val_bpb

0.6580

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.8MB

Training Techniques

Architecture

MLP3x

3.0x MLP expansion with LeakyReLU activation

parameters: {"hidden_multiplier":3}

GQA

Uses grouped query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Applies rotary position embeddings to a subset of head dimensions

parameters: {"dimensions":"16/64"}

XSA

XSA applied on all layers

parameters: {"layers":11}

BigramHash

Bigram hash feature with XOR hashing

parameters: {"dimensions":"3072x112"}

Value Embeddings

Value embeddings used in later layers

parameters: {"layers":[9,10]}

U-Net skip connections

U-Net style skip connections with SmearGate

parameters: null

SmearGate

SmearGate used in U-Net skip connections

parameters: null

weight tying

Tied embeddings

parameters: null

Regularization

logit softcap

parameters: {"value":30}

Quantization

GPTQ

bits: 6

scope: all

late QAT

bits: 6

scope: all

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_interval_steps":50}

Optimizer

AdamW

weight_decay: 1e-8

momentum: null

other_params: {"betas":[0.9,0.95],"eps":0.00001}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.001,"epochs":1,"chunk_tokens":32768,"freeze_blocks":10}

Other

other

Per-sample SLOT v3 optimization on top of TTT-adapted model using ephemeral delta and logit bias parameters

parameters: {"steps":24,"learning_rate":0.024}

Compression

lzma

level: 9

Sequence Length

sequence_length

train_length: 32768

eval_length: null

LR Schedule

cosine decay

parameters: {"start_lr":0.024,"end_lr":0.001}

Novel Contributions

Pre-quant score-first test-time training on already-scored chunks
Per-sample SLOT v3 applied after TTT for additional adaptation
TTT → SLOT cascade on top of the PR #1019 SOTA stack
Three-seed verified record result with low variance
No scored-region SLOT leakage and no target-in-key n-gram cache