PR #1889

open

Add draft Arman SP8192 credit-blocked record

by thearmankarapetyanView on GitHub

val_bpb

1.0786

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,979,228 bytes

Training Techniques

Quantization

mixed int6/int8

bits: null

scope: all

Architecture

depth recurrence

Uses 3-layer depth recurrence in the model stack.

parameters: {"layers":3}

GQA

Uses 8 query heads and 4 KV heads.

parameters: {"query_heads":8,"kv_heads":4}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: 0.095

momentum: 0.9

other_params: {"mlp_weight_decay":0.115}

Compression

Brotli

level: null

Test-Time Training

score-first TTT

parameters: {"enabled":1,"learning_rate":0.005,"epochs":4,"momentum":0.9,"chunk_tokens":32768}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.72}

Regularization

weight decay

parameters: {"muon_wd":0.095,"muon_wd_mlp":0.115}

Other

other

Uses a legal score-first TTT setup with a wallclock reserve and a rerun plan for valid logs.

parameters: {"max_wallclock_seconds":590,"gptq_reserve_seconds":0.5}

Novel Contributions

SP8192 cached data and tokenizer
11-layer 512d Transformer with 8 query heads and 4 KV heads
3-layer depth recurrence
parallel residual lanes
QK gain 5.25
Muon-style optimizer with split weight decay
EMA
GPTQ int6/int8 compression plus Brotli
legal score-first TTT
wallclock-reserved rerun plan