PR #461

open

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB)

by Christopher-Lee-McClendonView on GitHub
val_bpb
1.1446
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.79 MB

Training Techniques

Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs_per_chunk":3,"chunk_size":32768,"stride":64,"freeze_blocks":2,"momentum":0.9}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002,"epochs_per_chunk":3,"freeze_blocks":2}
Architecture
depth recurrence
11 logical layers implemented with 10 unique shared BlockCores, reusing one core at two depths with independent normalization.
parameters: {"layers":11,"unique_layers":10}
Partial RoPE
Rotary embeddings applied to only part of each head dimension, with NTK-aware scaling.
parameters: {"dimensions":16,"total_dimensions":64}
Value Embeddings
128-dim learned value embeddings added to value projections on deep layers only.
parameters: {"dimensions":128,"layers":[9,10]}
XSA
Exclusive Self Attention used in the last 4 layers.
parameters: {"last_n_layers":4}
SmearGate
Per-dimension gating mechanism in the MLP/attention stack.
parameters: null
BigramHash
Hashed bigram features added as an architectural component.
parameters: {"features":2048}
MLP3x
MLP expansion factor of 3x with ReLU² activation.
parameters: {"expansion":3}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Weight Averaging
SWA
parameters: {"checkpoints":12,"start_step":4650}
Quantization
int6 + zstd
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • High-yield legal test-time training using SGD with momentum, multiple epochs per chunk, and freezing early blocks
  • Depth recurrence with 11 logical layers from 10 unique shared BlockCores
  • Partial RoPE using only 16 of 64 dimensions with NTK-aware scaling
  • Value embeddings applied only to deep layers
  • Layer-norm depth scaling using 1/sqrt(layer+1)
  • Score-first legal TTT where every validation token is scored before any weight update