PR #1520

open

SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824

by taka6745View on GitHub

val_bpb

1.0824

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

16,051,190 bytes

Training Techniques

Architecture

Gated Attention

Per-head learnable sigmoid gate on attention outputs to suppress noisy or redundant heads.

parameters: null

depth recurrence

Layers 3-5 are looped multiple times to create virtual depth.

parameters: {"virtual_layers":17,"physical_layers":11}

weight tying

Tied embeddings are used.

parameters: null

LeakyReLU

MLP activation uses LeakyReLU squared.

parameters: {"negative_slope":0.5}

Partial RoPE

Rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":16,"base_dimensions":64}

Optimizer

NorMuon

weight_decay: null

momentum: null

other_params: {"post_ns_row_normalization":true}

Parallel Muon

weight_decay: null

momentum: null

other_params: {"batched_newton_schulz":true}

Regularization

dropout

parameters: {"type":"Norm-PCT-Dropout","top_l2_norm_row_fraction":0.01,"target":"FFN intermediate activations"}

logit softcap

parameters: {"value":30}

dropout

parameters: {"type":"skip gates","description":"sigmoid-gated U-Net skip connections"}

layerwise LN scale

parameters: null

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices; int8 embeddings

int8

bits: 8

scope: embeddings

Evaluation

sliding window eval

parameters: {"window":8192}

Test-Time Training

score-first TTT

parameters: {"chunk_size":32000,"epochs":3}

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

LR Schedule

warmdown

parameters: {"warmdown_fraction":0.72}

Novel Contributions

Gated Attention
NorMuon (post-NS row normalization)
Norm-PCT-Dropout
Parallel Muon (batched Newton-Schulz)
Legal score-first TTT on SP8192
Improved quantization efficiency relative to the prior SOTA