PR #526

open

Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252)

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.1425

Architecture

Transformer

Optimizer

SGD with momentum

Artifact Size

15.48 MB

Training Techniques

Quantization

int6

bits: 6

scope: all

zstd

level: 22

Architecture

depth recurrence

11 logical layers with 10 unique BlockCores reused at different depths with independent normalization

parameters: {"logical_layers":11,"unique_layers":10}

Partial RoPE

Rotary Position Embeddings applied to only 16 of 64 dimensions per head with NTK-aware scaling

parameters: {"dimensions":16,"total_dimensions":64}

Value Embeddings

128-dim learned value embeddings added to value projection on layers 9 and 10

parameters: {"layers":[9,10],"embedding_dim":128}

SmearGate

SwiGLU-style activation with gating

parameters: null

BigramHash

Bigram hash embeddings with 2048 features

parameters: {"features":2048}

XSA

Cross-sequence attention applied on last 4 layers

parameters: {"layers":4}

Layer-Norm Scale

Layer-wise scaling of residual outputs by 1/sqrt(layer_idx + 1)

parameters: null

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002}

Weight Averaging

SWA

parameters: {"start_step":4650,"checkpoints":12}

Compression

zstd

level: 22

Evaluation

stride-based eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","momentum":0.9,"learning_rate":0.002,"epochs_per_chunk":30,"chunk_size":32768,"frozen_blocks":2,"trainable_params":19911748}

Regularization

freeze early layers

parameters: {"frozen_blocks":2}

Other

other

Legal score-first TTT protocol using torch.inference_mode() to prevent gradient leakage during scoring

parameters: null

Novel Contributions

Demonstrated large TTT gains by increasing SGD epochs per chunk from 3 to 30 in legal score-first TTT
Showed SGD with momentum outperforms AdamW for legal TTT due to better convergence with limited steps per chunk
Introduced depth recurrence with 11 logical layers but only 10 unique BlockCores to save parameters
Applied Partial RoPE to only 16 of 64 dimensions per head with NTK-aware scaling for better length generalization
Added 128-dim value embeddings only on deep layers (9 and 10) to bypass residual bottleneck
Used layer-norm depth scaling to stabilize training under depth recurrence
Implemented legal score-first TTT protocol strictly enforcing scoring before training with torch.inference_mode()
Freezing first 2 blocks during TTT to prevent catastrophic overfitting and improve adaptation