PR #537

open

Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.1387

Architecture

Transformer (11L depth recurrence with 10 unique BlockCores, LeakyReLU(0.5)² MLP)

Optimizer

Muon (hidden/attn) + Adam (embeddings/scalars) for training; SGD with momentum=0.9 for TTT

Artifact Size

15.36 MB

Training Techniques

Architecture

LeakyReLU(0.5)² MLP

Replaced ReLU² with LeakyReLU(0.5)² activation in MLP to preserve negative gradient flow and improve pre-TTT BPB

parameters: null

depth recurrence

11 logical layers with 10 unique shared BlockCores for weight-efficient depth

parameters: {"layers":11,"unique_layers":10}

Partial RoPE

Partial rotary positional embeddings with NTK-aware scaling

parameters: {"dimensions":"16/64"}

Value Embeddings

128-dimensional value embeddings applied on layers 9-10 with per-layer scale initialization

parameters: {"dimensions":128,"layers":[9,10]}

SmearGate

Learned token-mixing gate applied on input embeddings

parameters: null

BigramHash

Bigram hashing with 2048 features and 128-dimensional embeddings

parameters: {"features":2048,"embedding_dim":128}

XSA

Cross-sequence attention applied on last 4 layers

parameters: {"layers":4}

U-Net skips

Residual connections across layer pairs

parameters: null

LN Scale

LayerNorm scale with 1/sqrt(layer+1) depth scaling

parameters: null

Quantization

int6 QAT

bits: 6

scope: all model weights

Weight Averaging

SWA

parameters: {"start_step":4650}

Compression

zstd

level: 22

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","momentum":0.9,"base_learning_rate":0.002,"per_layer_lr":{"mlp.proj":3,"mlp.fc":0.5},"intra_chunk_cosine_decay":true,"epochs_per_chunk":30,"chunk_size_tokens":32768,"stride":64,"frozen_blocks":2,"trainable_params":19911748,"total_params":24634452}

LR Schedule

cosine decay

parameters: {"intra_chunk":true,"inter_chunk":true,"formula":"0.5 × (1 + cos(π × step / total_steps))"}

Optimizer

Muon + Adam

weight_decay: null

momentum: null

other_params: {"Muon":"used for hidden and attention parameters","Adam":"used for embeddings and scalar parameters"}

Novel Contributions

Use of LeakyReLU(0.5)² activation in MLP replacing ReLU² to improve pre-TTT BPB by ~0.0035
Application of per-layer learning rates during test-time training (TTT) with mlp.proj at 3× LR and mlp.fc at 0.5× LR
Intra-chunk cosine learning rate decay within each chunk's 30 TTT epochs
Integration of legal score-first TTT protocol with freezing first 2 blocks and 30 epochs per chunk
Demonstration that TTT modifications (per-layer LR and intra-chunk cosine) did not improve TTT gain in this architecture, with all final BPB improvement coming from pre-TTT model changes