PR #1520
openSP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824
by taka6745View on GitHub
val_bpb
1.0824
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
16,051,190 bytes
Training Techniques
Architecture
Gated Attention
Per-head learnable sigmoid gate on attention outputs to suppress noisy or redundant heads.
parameters: null
depth recurrence
Layers 3-5 are looped multiple times to create virtual depth.
parameters: {"virtual_layers":17,"physical_layers":11}
weight tying
Tied embeddings are used.
parameters: null
LeakyReLU
MLP activation uses LeakyReLU squared.
parameters: {"negative_slope":0.5}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":16,"base_dimensions":64}
Optimizer
NorMuon
weight_decay: null
momentum: null
other_params: {"post_ns_row_normalization":true}
Parallel Muon
weight_decay: null
momentum: null
other_params: {"batched_newton_schulz":true}
Regularization
dropout
parameters: {"type":"Norm-PCT-Dropout","top_l2_norm_row_fraction":0.01,"target":"FFN intermediate activations"}
logit softcap
parameters: {"value":30}
dropout
parameters: {"type":"skip gates","description":"sigmoid-gated U-Net skip connections"}
layerwise LN scale
parameters: null
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices; int8 embeddings
int8
bits: 8
scope: embeddings
Evaluation
sliding window eval
parameters: {"window":8192}
Test-Time Training
score-first TTT
parameters: {"chunk_size":32000,"epochs":3}
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
LR Schedule
warmdown
parameters: {"warmdown_fraction":0.72}
Novel Contributions
- Gated Attention
- NorMuon (post-NS row normalization)
- Norm-PCT-Dropout
- Parallel Muon (batched Newton-Schulz)
- Legal score-first TTT on SP8192
- Improved quantization efficiency relative to the prior SOTA