PR #1933

open

Record: SP8192 + yahya010 NN base + byte-PPM mixer — val_bpb 0.99145 …

by deborahnelson8788726View on GitHub
val_bpb
0.9915
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.90-15.91 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"enabled":false}
LoRA TTT
parameters: {"phased":true}
Architecture
depth recurrence
3-layer recurrence / parallel residuals lineage inherited from the base stack.
parameters: {"layers":3}
weight tying
SP-vocab / tied-embedding style lineage inherited from the base stack.
parameters: null
RoPE
Partial RoPE used in the inherited base stack.
parameters: null
Regularization
LN scale
parameters: null
Initialization
QK-Gain
QK-Gain 5.25 initialization used in the inherited base stack.
Other
other
Byte-level PPM-D adaptive mixture applied at evaluation time using an outcome-independent adaptive-λ gate.
parameters: {"order":4,"lambda_high":0.9,"lambda_low":0.05,"threshold":0.9}
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192

Novel Contributions

  • Composes yahya010's stronger SP8192 Transformer base with OE-GOD's byte-level PPM-D mixer
  • Applies verbatim byte-PPM mixture function during sliding-window evaluation after distributed all-reduce
  • Uses outcome-independent adaptive-λ gating for the PPM mixture
  • Disables phased TTT at runtime because PPM mixer is reported to outperform it on this stack
  • Reports a 3-seed mean validation score of 0.99145 bpb