PR #526
openNon-record: 11L + 30-Epoch Legal TTT (BPB 1.14252)
by Christopher-Lee-McClendonView on GitHub
val_bpb
1.1425
Architecture
Transformer
Optimizer
SGD with momentum
Artifact Size
15.48 MB
Training Techniques
Quantization
int6
bits: 6
scope: all
zstd
level: 22
Architecture
depth recurrence
11 logical layers with 10 unique BlockCores reused at different depths with independent normalization
parameters: {"logical_layers":11,"unique_layers":10}
Partial RoPE
Rotary Position Embeddings applied to only 16 of 64 dimensions per head with NTK-aware scaling
parameters: {"dimensions":16,"total_dimensions":64}
Value Embeddings
128-dim learned value embeddings added to value projection on layers 9 and 10
parameters: {"layers":[9,10],"embedding_dim":128}
SmearGate
SwiGLU-style activation with gating
parameters: null
BigramHash
Bigram hash embeddings with 2048 features
parameters: {"features":2048}
XSA
Cross-sequence attention applied on last 4 layers
parameters: {"layers":4}
Layer-Norm Scale
Layer-wise scaling of residual outputs by 1/sqrt(layer_idx + 1)
parameters: null
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002}
Weight Averaging
SWA
parameters: {"start_step":4650,"checkpoints":12}
Compression
zstd
level: 22
Evaluation
stride-based eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","momentum":0.9,"learning_rate":0.002,"epochs_per_chunk":30,"chunk_size":32768,"frozen_blocks":2,"trainable_params":19911748}
Regularization
freeze early layers
parameters: {"frozen_blocks":2}
Other
other
Legal score-first TTT protocol using torch.inference_mode() to prevent gradient leakage during scoring
parameters: null
Novel Contributions
- Demonstrated large TTT gains by increasing SGD epochs per chunk from 3 to 30 in legal score-first TTT
- Showed SGD with momentum outperforms AdamW for legal TTT due to better convergence with limited steps per chunk
- Introduced depth recurrence with 11 logical layers but only 10 unique BlockCores to save parameters
- Applied Partial RoPE to only 16 of 64 dimensions per head with NTK-aware scaling for better length generalization
- Added 128-dim value embeddings only on deep layers (9 and 10) to bypass residual bottleneck
- Used layer-norm depth scaling to stabilize training under depth recurrence
- Implemented legal score-first TTT protocol strictly enforcing scoring before training with torch.inference_mode()
- Freezing first 2 blocks during TTT to prevent catastrophic overfitting and improve adaptation