PR #1338

closed

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean)

val_bpb

1.0955

Architecture

Transformer

Optimizer

SGD

Artifact Size

~15.49 MB

Training Techniques

Architecture

BigramHash

Adds bigram hash embeddings / n-gram side channel to logits.

parameters: {"vocab":2048,"dim":128}

depth recurrence

3-layer depth recurrence applied to layers 3, 4, and 5.

parameters: {"layers":[3,4,5],"start_step":3000}

parallel residuals

Uses parallel residual connections starting from layer 7.

parameters: {"start_layer":7}

SP2048 vocabulary

Uses a 2048-token SentencePiece BPE vocabulary.

parameters: {"vocab_size":2048}

Weight Averaging

SWA

parameters: {"start_frac":0.75}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":3}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"variant":"MuonEq-R","qk_gain":5,"matrix_lr":0.022}

Quantization

GPTQ

bits: 6

scope: full model

Compression

Brotli

level: null

First SP2048 submission combining SWA, BigramHash, 3-layer depth recurrence, and legal TTT
3-layer depth recurrence over layers 3, 4, and 5
BigramHash embeddings with a 2048-token vocabulary
Stochastic Weight Averaging starting from fraction 0.75
Legal score-first test-time training with SGD
Full GPTQ int6 plus Brotli artifact compression