PR #1406

open

Record: 11L Depth Recurrence + Discriminative Pre-Quant TTT (8xH100) — val_bpb 1.0887 (3-seed mean)

by aamodbhattView on GitHub

val_bpb

1.0887

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,926,365 bytes

Training Techniques

Architecture

depth recurrence

Blocks 4 and 5 are run twice in the forward pass, increasing effective depth without adding parameters.

parameters: {"layers":11,"recurrent_layers":[4,5],"effective_passes":13}

BigramHash

Uses a bigram vocabulary/hash component in the model.

parameters: {"vocab_size":1536}

XSA

Applies XSA in the last layers of the model.

parameters: {"last_n_layers":4}

VE128

Adds value residual enhancement with 128-dimensional VE in selected layers.

parameters: {"layers":[9,10],"dimension":128}

LeakyReLU

Uses LeakyReLU^2 activation in the MLP.

parameters: {"slope":0.5}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0005,"epochs":10,"freeze_blocks":0,"cosine_decay":true,"pre_quant":true}

Optimizer

AdamW

weight_decay: 0.04

momentum: null

other_params: {"pre_quant_adaptation":true,"discriminative_lr_scaling":true}

Quantization

GPTQ-lite

bits: 6

scope: all

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

weight decay

parameters: {"value":0.04}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

cosine decay

parameters: {"applied_to":"TTT"}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: null

eval_length: 32768

Compression

lzma

level: 7

Novel Contributions

Depth recurrence: blocks 4 and 5 are executed twice for zero-parameter effective depth increase.
Discriminative pre-quant TTT with per-block learning-rate scaling before GPTQ quantization.
Muon-style test-time adaptation using Newton-Schulz orthogonalized updates instead of SGD.
Entropy-adaptive TTT epochs selected per chunk based on chunk NLL.
Score-first TTT protocol with frozen model at evaluation time.