PR #1720
openNotable: SP8192 + 3-Layer Recurrence + Parallel Residuals - 5-Seed Quantization Reference and SDClip Ablations
by kiyoakiView on GitHub
val_bpb
1.0818
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,989,432 bytes
Training Techniques
Architecture
depth recurrence
3-layer recurrence loops layers 3-5, yielding 17 virtual layers from 11 physical layers.
parameters: {"layers":3,"virtual_layers":17,"physical_layers":11}
parallel residuals
GPT-J style parallel residuals where attention and MLP share the pre-residual input starting from layer 7.
parameters: {"start_layer":7}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
weight tying
Tied input and output embeddings.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"matrix_lr":0.026,"warmdown_frac":0.75}
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.75}
Quantization
GPTQ
bits: 6
scope: MLP and attention matrices
mixed int6/int8
bits: null
scope: int6 for attention and MLP, int8 for embeddings
GPTQ
bits: 8
scope: token embeddings
Evaluation
sliding window eval
parameters: {"no_ttt":true}
Compression
Brotli
level: 11
Novel Contributions
- 5-seed near-frontier reference for the SP8192 / 3-layer recurrence / parallel residual stack
- Empirical observation that post-quantization val_bpb is lower than pre-quantization across all 5 seeds
- Documented SDClip ablation sweep showing quality/size tradeoffs and a negative-result clipping variant
- Reference implementation under the 16 MB decimal artifact cap with all five seeds
- Canonical SP8192 utf-8 byte-accounting validation with no TTT