PR #1550

open

Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence — val_bpb 1.0587 (3-seed mean)

by translatingthenameView on GitHub
val_bpb
1.0587
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.5 MB

Training Techniques

Test-Time Training
full TTT
parameters: {"learning_rate":0.0005,"epochs":6,"freeze_blocks":2,"batch_size":32,"sequence_length":2048,"compiled":true}
Quantization
GPTQ
bits: 6
scope: all
int8
bits: 8
scope: embeddings
Architecture
depth recurrence
Repeats layers 3-5 once to create 14 virtual layers from 11 physical layers.
parameters: {"physical_layers":11,"virtual_layers":14,"repeat_layers":[3,4,5]}
weight tying
Tied input and output embeddings.
parameters: null
Partial RoPE
Uses rotary position embeddings on only part of the head dimensions.
parameters: {"dimensions":"16/64"}
XSA
Applies XSA attention across all layers.
parameters: {"layers":11}
SmearGate
Uses SmearGate in the architecture.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R","newton_schulz_steps":4}
AdamW
weight_decay: 0.095
momentum: null
other_params: {"scope":"embeddings","learning_rate":0.03}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"scope":"scalars","learning_rate":0.02}
LR Schedule
warmdown
parameters: {"final_fraction":0.72,"target_lr":0}
cosine decay
parameters: {"final_multiplier":0.1}
Compression
Brotli
level: 11
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • Non-record pre-quant AdamW TTT that violates Condition 3 by training on validation tokens before scoring them
  • Compiled TTT with torch.compile for roughly 2x speedup
  • Artifact budget engineering for SP8192, including VE dimension selection to avoid pruning
  • Depth recurrence combined with parallel residuals and XSA in a compact 11-layer Transformer
  • Empirical comparison of illegal pre-quant TTT versus legal score-first TTT boundary