PR #1959
openRecord: SP8192 + 3-layer recurrence + byte-PPM mixer — val_bpb 0.99621 (3-seed mean)
by remg1997View on GitHub
val_bpb
0.9962
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.997 MB
Training Techniques
Architecture
depth recurrence
3-layer recurrence stack with parallel residuals and repeated layer loops in the Transformer backbone.
parameters: {"layers":3}
weight tying
Tied token embeddings.
parameters: null
Partial RoPE
Partial rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
LeakyReLU
LeakyReLU squared activation in the MLP.
parameters: {"squared":true}
GQA
Grouped-query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
MLP width multiplier of 4x in the Transformer block.
parameters: {"multiplier":4}
Quantization
GPTQ
bits: 6
scope: all attention/MLP matrices
GPTQ
bits: 8
scope: token embeddings
Regularization
layerwise LN scale
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
LR Schedule
warmdown
parameters: {"final_fraction":0.72}
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Causal byte-level PPM-D order-4 mixer applied at evaluation time with an outcome-independent adaptive lambda gate.
parameters: {"order":4,"eval_time_only":true}
Test-Time Training
score-first TTT
parameters: {"enabled_for_diagnostics":true,"disabled_in_submission":true}
Novel Contributions
- Combines SP8192 3-layer recurrence stack with a corrected causal byte-level PPM mixer.
- Uses an outcome-independent adaptive gate for mixing neural byte probabilities with PPM probabilities.
- Reports a 3-seed mean val_bpb of 0.99621, with per-seed logs and compliance verification.
- Disables TTT in the submitted artifact to meet the eval-time budget while preserving the PPM mixer result.