PR #1776

open

Record: SP8192 ParResid 3LayerLoop QK5.25 LegalTTT — 1.08083 BPB

by anmarhindiView on GitHub

val_bpb

1.0808

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.97 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: matrices

GPTQ

bits: 8

scope: embeddings

Architecture

depth recurrence

Loops layers 3-5 twice with delayed activation.

parameters: {"layers":[3,4,5],"loops":2,"activate_frac":0.35}

parallel residuals

Attention and MLP share the same pre-residual input in later layers.

parameters: {"start_layer":7}

GQA

Uses grouped-query attention / FA3-SDPA backend with enable_gqa.

parameters: {"kv_heads":4,"heads":8}

Partial RoPE

Applies rotary position embeddings to a subset of head dimensions.

parameters: {"head_dims":"16/64"}

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

weight tying

Tied input and output embeddings.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

weight decay

parameters: {"muon_wd":0.095,"embed_wd":0.095}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.005,"epochs_per_chunk":3,"freeze_first_blocks":9,"gradient_clip":1}

Weight Averaging

EMA

parameters: {"decay":0.9965}

LR Schedule

cosine decay

parameters: {"warmdown_frac":0.72}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs_per_chunk":3,"chunk_size_tokens":32000}

Evaluation

sliding window eval

parameters: {"causal":true}

Compression

Brotli

level: 11

Novel Contributions

Independent re-port of the SP8192 + prior SOTA stack with a FA3/SDPA backend switch for broader hardware support
3-layer depth recurrence over layers 3-5
Parallel residuals in later layers
QK gain scaling at 5.25
Legal score-first test-time training under the competition rules
Mixed GPTQ quantization with int6 matrices and int8 embeddings fitting under 16 MB without pruning