PR #1150

open

Legal TTT (SGD, 3-epoch) + SLOT (lr=0.003, steps=5) on PR #549 base -- val_bpb: 1.11512 (3-seed mean, beats merged SOTA 1.1194)

by sahiee-devView on GitHub

val_bpb

1.1151

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.95-15.96MB

Training Techniques

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Rotary position embeddings applied to a subset of dimensions.

parameters: {"rope_dims":16}

LeakyReLU

LeakyReLU squared MLP activation.

parameters: {"squared":true,"mlp_mult":2.8}

BigramHash

Bigram hash embedding module.

parameters: {"vocab":1536,"dim":128}

VE128

VE128 used at layers 9-10.

parameters: {"layers":[9,10]}

XSA

XSA used in the last 4 layers.

parameters: {"layers":4}

LN Scale

LayerNorm scale modification.

parameters: null

SmearGate

SmearGate component included in the architecture.

parameters: null

U-Net skip connections

U-Net style skip connections in the transformer backbone.

parameters: null

Weight Averaging

EMA + SWA

parameters: {"decay":0.997}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: 6

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"adamw":true}

SGD

weight_decay: null

momentum: null

other_params: {"test_time_training":true,"epochs":3}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"chunk_tokens":32768,"batch_seqs":32,"epochs":3}

Other

other

SLOT test-time adaptation using a per-batch residual delta optimized on top of frozen hidden states before the final logits projection.

parameters: {"lr":0.003,"steps":5}

Regularization

LN scale

parameters: null

Novel Contributions

Adds SLOT test-time adaptation on top of legal score-first TTT.
Uses a per-batch residual delta in hidden space to adapt logits without updating model weights.
Combines legal TTT with SLOT while preserving score-first, left-to-right evaluation constraints.
Achieves a 3-seed mean val_bpb of 1.11512, beating the merged SOTA of 1.1194.
Keeps artifact size under 16MB and evaluation time under 600s across all seeds.