PR #1039

open

Add LeakyReLU² + 4ep Legal TTT submission

by yufengli-oaiView on GitHub

val_bpb

1.1184

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,882,595 bytes

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU squared activation in the model.

parameters: {"power":2,"slope":0.5}

BigramHash

Uses bigram hashing embedding.

parameters: {"vocab_size":1536}

XSA

Uses XSA in the last layers.

parameters: {"last_n_layers":4}

Partial RoPE

Applies RoPE only partially.

parameters: {"dimensions":16}

VE128

Uses value residual embeddings/paths with dimension 128.

parameters: {"dimension":128}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"every":50}

Quantization

late QAT

bits: 6

scope: model

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0025,"epochs":4,"chunk_tokens":32768,"momentum":0.9,"freeze_blocks":0,"batch_seqs":32,"grad_clip":1}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

cosine decay

parameters: {"ttt":true}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

LN scale

parameters: {"enabled":true}

Novel Contributions

Improved Legal TTT submission based on the prior LeakyReLU LegalTTT Parallel Muon run
Increased legal TTT learning rate from 0.002 to 0.0025
Increased legal TTT epochs from 3 to 4
Skipped diagnostic pre-TTT evaluations to keep evaluation under the time limit
Added eval-only checkpoint loading for faster TTT sweeps
Combined LeakyReLU² with Parallel Muon, EMA, SWA, and late QAT