PR #1737

open

[Submission] SP8192 FullStack PartialRoPE LeakyReLU - 2026-04-19

by sakthivarshansView on GitHub
val_bpb
1.0723
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
Partial RoPE
Apply rotary embeddings only to the first portion of head dimensions, leaving the rest unrotated.
parameters: {"dimensions":16,"total_dimensions":64}
LeakyReLU
Use LeakyReLU squared activation instead of ReLU squared.
parameters: {"negative_slope":0.5}
depth recurrence
Repeat layers 3-5 three times total to add recurrent depth.
parameters: {"layers":[3,4,5],"repetitions":3}
weight tying
Tie input embeddings and output embeddings.
parameters: null
U-Net skip connections
Use encoder-decoder style skip connections with sigmoid gating.
parameters: null
parallel residuals
Use GPT-J style parallel attention and MLP residual paths from later layers.
parameters: {"start_layer":7}
Quantization
mixed int6/int8
bits: null
scope: attention/MLP int6, embeddings int8
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"row_normalized":true,"newton_schulz_steps":5,"mlr":0.022}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
Brotli
level: 11
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3}
LR Schedule
cosine decay
parameters: {"across_chunks":true}
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
weight decay
parameters: {"value":0.095}
Sequence Length
sequence_length
train_length: 32000
eval_length: null

Novel Contributions

  • Partial RoPE on the first 75% of head dimensions
  • Per-layer learnable LayerNorm scaling
  • LeakyReLU(0.5)^2 activation
  • HessianSD gradient clipping
  • Legal score-first test-time training
  • Sigmoid-gated U-Net skip connections
  • Progressive recurrence phases