PR #1437

open

Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Legal N-gram Tilt — val_bpb 1.07800 (3-seed mean)

by dexhunterView on GitHub

val_bpb

1.0780

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,993,733 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: all weights

int8

bits: 8

scope: embeddings

Architecture

parallel residuals

GPT-J style parallel attention and MLP on layers 7-10 instead of sequential residual blocks.

parameters: {"start_layer":7,"end_layer":10}

depth recurrence

Loops layers 3-5 twice to create additional virtual depth.

parameters: {"loop_start":3,"loop_end":5}

weight tying

Tied token embeddings.

parameters: null

Partial RoPE

Uses partial rotary position embeddings.

parameters: {"dimensions":16,"total_dimensions":64}

LeakyReLU

Uses LeakyReLU squared activation.

parameters: {"negative_slope":0.5}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3}

Evaluation

sliding window eval

parameters: null

Other

other

Eval-time causal n-gram tilt using a strict-prefix cache and one-token exponential boost over the full vocabulary.

parameters: {"enabled":true}

Optimizer

Muon

weight_decay: 0.085

momentum: null

other_params: null

Regularization

weight decay

parameters: {"value":0.085}

Novel Contributions

Parallel residuals on layers 7-10
3-layer depth recurrence extending the prior 2-layer recurrence
Legal causal n-gram tilt at evaluation time
Stacking of parallel residuals, recurrence, and n-gram tilt on top of the SP8192 baseline
Self-extracting LZMA mini wrapper to fit within the 16 MB artifact limit