PR #1339

open

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean)

val_bpb

1.0955

Architecture

Transformer

Optimizer

SGD

Artifact Size

~15.49 MB

Training Techniques

Architecture

BigramHash

Adds a bigram hash embedding side channel to the logits.

parameters: {"vocab":2048,"dim":128}

depth recurrence

Uses 3-layer depth recurrence across layers 3, 4, and 5.

parameters: {"layers":[3,4,5],"start_step":3000}

weight tying

Uses SP2048 SentencePiece BPE vocabulary with tied tokenization setup implied by the canonical model family; no explicit weight tying was stated.

parameters: {"vocab_size":2048}

parallel residuals

Applies parallel residual connections starting from layer 7.

parameters: {"start_layer":7}

SP2048 vocabulary

Uses a 2048-token SentencePiece BPE vocabulary.

parameters: {"vocab_size":2048}

Weight Averaging

SWA

parameters: {"start_frac":0.75}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":3}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"matrix_lr":0.022,"variant":"MuonEq-R","qk_gain":5}

Quantization

GPTQ

bits: 6

scope: full model

Compression

Brotli

level: null

First SP2048 submission combining SWA, BigramHash, 3-layer depth recurrence, and legal TTT
3-layer depth recurrence over layers 3, 4, and 5
BigramHash embeddings with a 2048-token vocabulary
Stochastic Weight Averaging from fraction 0.75
Legal score-first test-time training
Full GPTQ int6 quantization with Brotli compression