PR #1437

open

Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Legal N-gram Tilt — val_bpb 1.07800 (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.0780
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,993,733 bytes

Training Techniques

Quantization
GPTQ
bits: 6
scope: all weights
int8
bits: 8
scope: embeddings
Architecture
parallel residuals
GPT-J style parallel attention and MLP on layers 7-10 instead of sequential residual blocks.
parameters: {"start_layer":7,"end_layer":10}
depth recurrence
Loops layers 3-5 twice to create additional virtual depth.
parameters: {"loop_start":3,"loop_end":5}
weight tying
Tied token embeddings.
parameters: null
Partial RoPE
Uses partial rotary position embeddings.
parameters: {"dimensions":16,"total_dimensions":64}
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"negative_slope":0.5}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3}
Evaluation
sliding window eval
parameters: null
Other
other
Eval-time causal n-gram tilt using a strict-prefix cache and one-token exponential boost over the full vocabulary.
parameters: {"enabled":true}
Optimizer
Muon
weight_decay: 0.085
momentum: null
other_params: null
Regularization
weight decay
parameters: {"value":0.085}

Novel Contributions

  • Parallel residuals on layers 7-10
  • 3-layer depth recurrence extending the prior 2-layer recurrence
  • Legal causal n-gram tilt at evaluation time
  • Stacking of parallel residuals, recurrence, and n-gram tilt on top of the SP8192 baseline
  • Self-extracting LZMA mini wrapper to fit within the 16 MB artifact limit