PR #1722

open

Record: Trinity SLOT v3 + Pre-Quant TTT — val_bpb 0.65802 (3-seed mean)

by deborahnelson8788726View on GitHub
val_bpb
0.6580
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.8MB

Training Techniques

Architecture
MLP3x
3.0x MLP expansion with LeakyReLU activation
parameters: {"hidden_multiplier":3}
GQA
Uses grouped query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Applies rotary position embeddings to a subset of head dimensions
parameters: {"dimensions":"16/64"}
XSA
XSA applied on all layers
parameters: {"layers":11}
BigramHash
Bigram hash feature with XOR hashing
parameters: {"dimensions":"3072x112"}
Value Embeddings
Value embeddings used in later layers
parameters: {"layers":[9,10]}
U-Net skip connections
U-Net style skip connections with SmearGate
parameters: null
SmearGate
SmearGate used in U-Net skip connections
parameters: null
weight tying
Tied embeddings
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Quantization
GPTQ
bits: 6
scope: all
late QAT
bits: 6
scope: all
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval_steps":50}
Optimizer
AdamW
weight_decay: 1e-8
momentum: null
other_params: {"betas":[0.9,0.95],"eps":0.00001}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.001,"epochs":1,"chunk_tokens":32768,"freeze_blocks":10}
Other
other
Per-sample SLOT v3 optimization on top of TTT-adapted model using ephemeral delta and logit bias parameters
parameters: {"steps":24,"learning_rate":0.024}
Compression
lzma
level: 9
Sequence Length
sequence_length
train_length: 32768
eval_length: null
LR Schedule
cosine decay
parameters: {"start_lr":0.024,"end_lr":0.001}

Novel Contributions

  • Pre-quant score-first test-time training on already-scored chunks
  • Per-sample SLOT v3 applied after TTT for additional adaptation
  • TTT → SLOT cascade on top of the PR #1019 SOTA stack
  • Three-seed verified record result with low variance
  • No scored-region SLOT leakage and no target-in-key n-gram cache