PR #1977

open

SP8192 + PolarExpressNS + MIN_LR + LQER Asym Rank-4 | val_bpb=1.07302 (3-seed mean)

by sahiee-devView on GitHub

val_bpb

1.0730

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,953,488 bytes

Training Techniques

Architecture

SP8192

Uses the SP8192 tokenizer base.

parameters: null

SmearGate

Adds SmearGate and AttnOutGate width 24.

parameters: {"width":24}

depth recurrence

Implements a 3-layer depth recurrence mechanism.

parameters: {"layers":3}

weight tying

Uses tied embeddings / embedding tying if implied by the base stack.

parameters: null

Gated Attention

Includes AttnOutGate as part of the attention/output gating stack.

parameters: {"width":24}

PolarExpressNS

Uses Polar Express Newton-Schulz coefficients.

parameters: null

LQER

Uses asymmetric rank-4 LQER with top-K=3.

parameters: {"rank":4,"top_k":3,"asymmetric":true}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"symmetric_row_col_normalization":true}

LR Schedule

warmdown

parameters: {"min_lr":0.1}

Test-Time Training

score-first TTT

parameters: null

LoRA TTT

parameters: null

Novel Contributions

SP8192 tokenizer base
SmearGate + AttnOutGate width 24
LoRA TTT improvements
Phased TTT
Polar Express Newton-Schulz coefficients
MIN_LR=0.10 warmdown floor
LQER asymmetric rank-4 with top-K=3
3-layer depth recurrence
Muon optimizer with symmetric row/column normalization