PR #1889

open

Add draft Arman SP8192 credit-blocked record

by thearmankarapetyanView on GitHub
val_bpb
1.0786
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,979,228 bytes

Training Techniques

Quantization
mixed int6/int8
bits: null
scope: all
Architecture
depth recurrence
Uses 3-layer depth recurrence in the model stack.
parameters: {"layers":3}
GQA
Uses 8 query heads and 4 KV heads.
parameters: {"query_heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.9
other_params: {"mlp_weight_decay":0.115}
Compression
Brotli
level: null
Test-Time Training
score-first TTT
parameters: {"enabled":1,"learning_rate":0.005,"epochs":4,"momentum":0.9,"chunk_tokens":32768}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.72}
Regularization
weight decay
parameters: {"muon_wd":0.095,"muon_wd_mlp":0.115}
Other
other
Uses a legal score-first TTT setup with a wallclock reserve and a rerun plan for valid logs.
parameters: {"max_wallclock_seconds":590,"gptq_reserve_seconds":0.5}

Novel Contributions

  • SP8192 cached data and tokenizer
  • 11-layer 512d Transformer with 8 query heads and 4 KV heads
  • 3-layer depth recurrence
  • parallel residual lanes
  • QK gain 5.25
  • Muon-style optimizer with split weight decay
  • EMA
  • GPTQ int6/int8 compression plus Brotli
  • legal score-first TTT
  • wallclock-reserved rerun plan