PR #1980
openNon-record: final-day SP8192 reproduction and mHC-lite local probe
by KbediakoView on GitHub
val_bpb
1.0738
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,929,546 bytes
Training Techniques
Test-Time Training
score-first TTT
parameters: null
Architecture
attention sink
Adds a learned per-head sigmoid scale for the first value vector in the causal sequence.
parameters: null
mHC-lite
Softmax residual mixing for block residual/input paths.
parameters: {"enabled":true}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"warmup_start":0.92,"warmup_steps":1500}
Regularization
logit softcap
parameters: {"value":15}
LR Schedule
warmdown
parameters: {"warmdown_steps":150}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Novel Contributions
- Single-seed full-validation reproduction of the SP8192 phased-TTT lineage with QK-Gain 5.25
- mHC-lite softmax residual mixing
- Causal per-head attention sink
- Local RTX 5080 smoke-train/full-validation probe with three seeds
- Final-day non-record documentation of two 8xH100 attempts that exited before scoring