PR #1406
openRecord: 11L Depth Recurrence + Discriminative Pre-Quant TTT (8xH100) — val_bpb 1.0887 (3-seed mean)
by aamodbhattView on GitHub
val_bpb
1.0887
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,926,365 bytes
Training Techniques
Architecture
depth recurrence
Blocks 4 and 5 are run twice in the forward pass, increasing effective depth without adding parameters.
parameters: {"layers":11,"recurrent_layers":[4,5],"effective_passes":13}
BigramHash
Uses a bigram vocabulary/hash component in the model.
parameters: {"vocab_size":1536}
XSA
Applies XSA in the last layers of the model.
parameters: {"last_n_layers":4}
VE128
Adds value residual enhancement with 128-dimensional VE in selected layers.
parameters: {"layers":[9,10],"dimension":128}
LeakyReLU
Uses LeakyReLU^2 activation in the MLP.
parameters: {"slope":0.5}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0005,"epochs":10,"freeze_blocks":0,"cosine_decay":true,"pre_quant":true}
Optimizer
AdamW
weight_decay: 0.04
momentum: null
other_params: {"pre_quant_adaptation":true,"discriminative_lr_scaling":true}
Quantization
GPTQ-lite
bits: 6
scope: all
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
weight decay
parameters: {"value":0.04}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
cosine decay
parameters: {"applied_to":"TTT"}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: 32768
Compression
lzma
level: 7
Novel Contributions
- Depth recurrence: blocks 4 and 5 are executed twice for zero-parameter effective depth increase.
- Discriminative pre-quant TTT with per-block learning-rate scaling before GPTQ quantization.
- Muon-style test-time adaptation using Newton-Schulz orthogonalized updates instead of SGD.
- Entropy-adaptive TTT epochs selected per chunk based on chunk NLL.
- Score-first TTT protocol with frozen model at evaluation time.