PR #1737
open[Submission] SP8192 FullStack PartialRoPE LeakyReLU - 2026-04-19
by sakthivarshansView on GitHub
val_bpb
1.0723
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB
Training Techniques
Architecture
Partial RoPE
Apply rotary embeddings only to the first portion of head dimensions, leaving the rest unrotated.
parameters: {"dimensions":16,"total_dimensions":64}
LeakyReLU
Use LeakyReLU squared activation instead of ReLU squared.
parameters: {"negative_slope":0.5}
depth recurrence
Repeat layers 3-5 three times total to add recurrent depth.
parameters: {"layers":[3,4,5],"repetitions":3}
weight tying
Tie input embeddings and output embeddings.
parameters: null
U-Net skip connections
Use encoder-decoder style skip connections with sigmoid gating.
parameters: null
parallel residuals
Use GPT-J style parallel attention and MLP residual paths from later layers.
parameters: {"start_layer":7}
Quantization
mixed int6/int8
bits: null
scope: attention/MLP int6, embeddings int8
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"row_normalized":true,"newton_schulz_steps":5,"mlr":0.022}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
Brotli
level: 11
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3}
LR Schedule
cosine decay
parameters: {"across_chunks":true}
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
weight decay
parameters: {"value":0.095}
Sequence Length
sequence_length
train_length: 32000
eval_length: null
Novel Contributions
- Partial RoPE on the first 75% of head dimensions
- Per-layer learnable LayerNorm scaling
- LeakyReLU(0.5)^2 activation
- HessianSD gradient clipping
- Legal score-first test-time training
- Sigmoid-gated U-Net skip connections
- Progressive recurrence phases