PR #620

open

Record: LeakyReLU(0.5)² + Per-Document LoRA TTT (mean val_bpb=0.9443, 3 seeds)

by robinojwView on GitHub
val_bpb
0.9443
Architecture
U-Net
Optimizer
Muon
Artifact Size
15,430,887 B

Training Techniques

Architecture
LeakyReLU(0.5)²
Single-line activation swap replacing torch.relu(x) with F.leaky_relu(x, 0.5), preserves negative gradient flow and prevents dead neurons in squared activation
parameters: null
SmearGate
Learned token blending via sigmoid gate
parameters: null
BigramHash
Embedding with 2048 buckets and dimension 128
parameters: {"buckets":2048,"dim":128}
depth-scaled residuals
Residual connections scaled by 1/sqrt(layer+1)
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"lr":0.02,"momentum_warmup":"0.92→0.99"}
Adam
weight_decay: null
momentum: null
other_params: {"lr_embeddings":0.03,"lr_scalars":0.02}
Weight Averaging
SWA
parameters: {"decay":0.999}
Compression
zstd
level: 22
Quantization
int8
bits: 8
scope: per-row
Regularization
weight decay
parameters: {"value":0.04}
Test-Time Training
Per-document LoRA TTT
parameters: {"rank":8,"epochs":3,"chunk":256,"min_doc_len":512,"learning_rate":0.01,"adapted_layers":"Q, V projections and LM head","fresh_lora_per_document":true}
Initialization
OrthoInit
LR Schedule
warmdown
parameters: {"final_steps":3000}
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Use of LeakyReLU(0.5)² activation replacing ReLU to preserve negative gradient flow and prevent dead neurons in squared activation
  • Per-document backward-looking LoRA test-time training (TTT) with rank-8 LoRA applied on Q, V projections and LM head
  • SmearGate learned token blending via sigmoid gate
  • BigramHash embedding with 2048 buckets and 128 dimensions
  • Depth-scaled residuals scaled by 1/sqrt(layer+1)
  • Combination of Muon optimizer with Adam for embeddings and scalars
  • Use of SWA with decay 0.999
  • Artifact quantized with int8 per-row and compressed with zstd-22
  • Known issue with TTT scoring only on final epoch and proposed 1-line fix to score on every epoch