PR #1069
closedNon-record: 1.1190 BPB — Independent PR #549 Reproduction (10min 8×H100)
by manfromnowhere143View on GitHub
val_bpb
1.1190
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,948,863 bytes
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
XSA
Uses XSA4 attention/sequence mechanism.
parameters: null
Partial RoPE
Applies rotary position embeddings to only part of the head dimension.
parameters: {"partial":"16/64"}
SmearGate
Adds SmearGate to the model.
parameters: null
BigramHash
Adds bigram hash embeddings/features.
parameters: null
VE128
Uses value embeddings / value residual style features.
parameters: null
Regularization
LN scale
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Quantization
GPTQ-lite
bits: 6
scope: weights
GPTQ-lite
bits: 8
scope: weights
Test-Time Training
score-first TTT
parameters: {"steps":3,"learning_rate":0.0001}
Novel Contributions
- Independent reproduction and slight improvement of PR #549's stack
- 11-layer 512-d model with LeakyReLU², XSA4, Partial RoPE, LN Scale, EMA, Parallel Muon, GPTQ-lite, SmearGate, BigramHash, ValueEmbedding, and score-first TTT
- Achieved 1.1190 BPB under standard competition constraints
- Reported 7,166 steps in 600 seconds on 8×H100 SXM