PR #537
openNon-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)
by Christopher-Lee-McClendonView on GitHub
val_bpb
1.1387
Architecture
Transformer (11L depth recurrence with 10 unique BlockCores, LeakyReLU(0.5)² MLP)
Optimizer
Muon (hidden/attn) + Adam (embeddings/scalars) for training; SGD with momentum=0.9 for TTT
Artifact Size
15.36 MB
Training Techniques
Architecture
LeakyReLU(0.5)² MLP
Replaced ReLU² with LeakyReLU(0.5)² activation in MLP to preserve negative gradient flow and improve pre-TTT BPB
parameters: null
depth recurrence
11 logical layers with 10 unique shared BlockCores for weight-efficient depth
parameters: {"layers":11,"unique_layers":10}
Partial RoPE
Partial rotary positional embeddings with NTK-aware scaling
parameters: {"dimensions":"16/64"}
Value Embeddings
128-dimensional value embeddings applied on layers 9-10 with per-layer scale initialization
parameters: {"dimensions":128,"layers":[9,10]}
SmearGate
Learned token-mixing gate applied on input embeddings
parameters: null
BigramHash
Bigram hashing with 2048 features and 128-dimensional embeddings
parameters: {"features":2048,"embedding_dim":128}
XSA
Cross-sequence attention applied on last 4 layers
parameters: {"layers":4}
U-Net skips
Residual connections across layer pairs
parameters: null
LN Scale
LayerNorm scale with 1/sqrt(layer+1) depth scaling
parameters: null
Quantization
int6 QAT
bits: 6
scope: all model weights
Weight Averaging
SWA
parameters: {"start_step":4650}
Compression
zstd
level: 22
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","momentum":0.9,"base_learning_rate":0.002,"per_layer_lr":{"mlp.proj":3,"mlp.fc":0.5},"intra_chunk_cosine_decay":true,"epochs_per_chunk":30,"chunk_size_tokens":32768,"stride":64,"frozen_blocks":2,"trainable_params":19911748,"total_params":24634452}
LR Schedule
cosine decay
parameters: {"intra_chunk":true,"inter_chunk":true,"formula":"0.5 × (1 + cos(π × step / total_steps))"}
Optimizer
Muon + Adam
weight_decay: null
momentum: null
other_params: {"Muon":"used for hidden and attention parameters","Adam":"used for embeddings and scalar parameters"}
Novel Contributions
- Use of LeakyReLU(0.5)² activation in MLP replacing ReLU² to improve pre-TTT BPB by ~0.0035
- Application of per-layer learning rates during test-time training (TTT) with mlp.proj at 3× LR and mlp.fc at 0.5× LR
- Intra-chunk cosine learning rate decay within each chunk's 30 TTT epochs
- Integration of legal score-first TTT protocol with freezing first 2 blocks and 30 epochs per chunk
- Demonstration that TTT modifications (per-layer LR and intra-chunk cosine) did not improve TTT gain in this architecture, with all final BPB improvement coming from pre-TTT model changes