PR #1776
openRecord: SP8192 ParResid 3LayerLoop QK5.25 LegalTTT — 1.08083 BPB
by anmarhindiView on GitHub
val_bpb
1.0808
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.97 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 8
scope: embeddings
Architecture
depth recurrence
Loops layers 3-5 twice with delayed activation.
parameters: {"layers":[3,4,5],"loops":2,"activate_frac":0.35}
parallel residuals
Attention and MLP share the same pre-residual input in later layers.
parameters: {"start_layer":7}
GQA
Uses grouped-query attention / FA3-SDPA backend with enable_gqa.
parameters: {"kv_heads":4,"heads":8}
Partial RoPE
Applies rotary position embeddings to a subset of head dimensions.
parameters: {"head_dims":"16/64"}
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
weight tying
Tied input and output embeddings.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
weight decay
parameters: {"muon_wd":0.095,"embed_wd":0.095}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.005,"epochs_per_chunk":3,"freeze_first_blocks":9,"gradient_clip":1}
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
cosine decay
parameters: {"warmdown_frac":0.72}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs_per_chunk":3,"chunk_size_tokens":32000}
Evaluation
sliding window eval
parameters: {"causal":true}
Compression
Brotli
level: 11
Novel Contributions
- Independent re-port of the SP8192 + prior SOTA stack with a FA3/SDPA backend switch for broader hardware support
- 3-layer depth recurrence over layers 3-5
- Parallel residuals in later layers
- QK gain scaling at 5.25
- Legal score-first test-time training under the competition rules
- Mixed GPTQ quantization with int6 matrices and int8 embeddings fitting under 16 MB without pruning