PR #620

open

Record: LeakyReLU(0.5)² + Per-Document LoRA TTT (mean val_bpb=0.9443, 3 seeds)

by robinojwView on GitHub

val_bpb

0.9443

Architecture

U-Net

Optimizer

Muon

Artifact Size

15,430,887 B

Training Techniques

Architecture

LeakyReLU(0.5)²

Single-line activation swap replacing torch.relu(x) with F.leaky_relu(x, 0.5), preserves negative gradient flow and prevents dead neurons in squared activation

parameters: null

SmearGate

Learned token blending via sigmoid gate

parameters: null

BigramHash

Embedding with 2048 buckets and dimension 128

parameters: {"buckets":2048,"dim":128}

depth-scaled residuals

Residual connections scaled by 1/sqrt(layer+1)

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"lr":0.02,"momentum_warmup":"0.92→0.99"}

Adam

weight_decay: null

momentum: null

other_params: {"lr_embeddings":0.03,"lr_scalars":0.02}

Weight Averaging

SWA

parameters: {"decay":0.999}

Compression

zstd

level: 22

Quantization

int8

bits: 8

scope: per-row

Regularization

weight decay

parameters: {"value":0.04}

Test-Time Training

Per-document LoRA TTT

parameters: {"rank":8,"epochs":3,"chunk":256,"min_doc_len":512,"learning_rate":0.01,"adapted_layers":"Q, V projections and LM head","fresh_lora_per_document":true}

Initialization

OrthoInit

LR Schedule

warmdown

parameters: {"final_steps":3000}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Novel Contributions

Use of LeakyReLU(0.5)² activation replacing ReLU to preserve negative gradient flow and prevent dead neurons in squared activation
Per-document backward-looking LoRA test-time training (TTT) with rank-8 LoRA applied on Q, V projections and LM head
SmearGate learned token blending via sigmoid gate
BigramHash embedding with 2048 buckets and 128 dimensions
Depth-scaled residuals scaled by 1/sqrt(layer+1)
Combination of Muon optimizer with Adam for embeddings and scalars
Use of SWA with decay 0.999
Artifact quantized with int8 per-row and compressed with zstd-22
Known issue with TTT scoring only on final epoch and proposed 1-line fix to score on every epoch