PR #1737

open

[Submission] SP8192 FullStack PartialRoPE LeakyReLU - 2026-04-19

by sakthivarshansView on GitHub

val_bpb

1.0723

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

Partial RoPE

Apply rotary embeddings only to the first portion of head dimensions, leaving the rest unrotated.

parameters: {"dimensions":16,"total_dimensions":64}

LeakyReLU

Use LeakyReLU squared activation instead of ReLU squared.

parameters: {"negative_slope":0.5}

depth recurrence

Repeat layers 3-5 three times total to add recurrent depth.

parameters: {"layers":[3,4,5],"repetitions":3}

weight tying

Tie input embeddings and output embeddings.

parameters: null

U-Net skip connections

Use encoder-decoder style skip connections with sigmoid gating.

parameters: null

parallel residuals

Use GPT-J style parallel attention and MLP residual paths from later layers.

parameters: {"start_layer":7}

Quantization

mixed int6/int8

bits: null

scope: attention/MLP int6, embeddings int8

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"row_normalized":true,"newton_schulz_steps":5,"mlr":0.022}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

Brotli

level: 11

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3}

LR Schedule

cosine decay

parameters: {"across_chunks":true}

Regularization

layerwise LN scale

parameters: null

logit softcap

parameters: {"value":30}

weight decay

parameters: {"value":0.095}

Sequence Length

sequence_length

train_length: 32000

eval_length: null

Novel Contributions

Partial RoPE on the first 75% of head dimensions
Per-layer learnable LayerNorm scaling
LeakyReLU(0.5)^2 activation
HessianSD gradient clipping
Legal score-first test-time training
Sigmoid-gated U-Net skip connections
Progressive recurrence phases