PR #461

open

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB)

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.1446

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.79 MB

Training Techniques

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs_per_chunk":3,"chunk_size":32768,"stride":64,"freeze_blocks":2,"momentum":0.9}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002,"epochs_per_chunk":3,"freeze_blocks":2}

Architecture

depth recurrence

11 logical layers implemented with 10 unique shared BlockCores, reusing one core at two depths with independent normalization.

parameters: {"layers":11,"unique_layers":10}

Partial RoPE

Rotary embeddings applied to only part of each head dimension, with NTK-aware scaling.

parameters: {"dimensions":16,"total_dimensions":64}

Value Embeddings

128-dim learned value embeddings added to value projections on deep layers only.

parameters: {"dimensions":128,"layers":[9,10]}

XSA

Exclusive Self Attention used in the last 4 layers.

parameters: {"last_n_layers":4}

SmearGate

Per-dimension gating mechanism in the MLP/attention stack.

parameters: null

BigramHash

Hashed bigram features added as an architectural component.

parameters: {"features":2048}

MLP3x

MLP expansion factor of 3x with ReLU² activation.

parameters: {"expansion":3}

Regularization

layerwise LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Weight Averaging

SWA

parameters: {"checkpoints":12,"start_step":4650}

Quantization

int6 + zstd

bits: 6

scope: all

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

High-yield legal test-time training using SGD with momentum, multiple epochs per chunk, and freezing early blocks
Depth recurrence with 11 logical layers from 10 unique shared BlockCores
Partial RoPE using only 16 of 64 dimensions with NTK-aware scaling
Value embeddings applied only to deep layers
Layer-norm depth scaling using 1/sqrt(layer+1)
Score-first legal TTT where every validation token is scored before any weight update