PR #1933

open

Record: SP8192 + yahya010 NN base + byte-PPM mixer — val_bpb 0.99145 …

by deborahnelson8788726View on GitHub

val_bpb

0.9915

Architecture

Transformer

Optimizer

SGD

Artifact Size

15.90-15.91 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Compression

brotli

level: null

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"enabled":false}

LoRA TTT

parameters: {"phased":true}

Architecture

depth recurrence

3-layer recurrence / parallel residuals lineage inherited from the base stack.

parameters: {"layers":3}

weight tying

SP-vocab / tied-embedding style lineage inherited from the base stack.

parameters: null

RoPE

Partial RoPE used in the inherited base stack.

parameters: null

Regularization

LN scale

parameters: null

Initialization

QK-Gain

QK-Gain 5.25 initialization used in the inherited base stack.

Other

other

Byte-level PPM-D adaptive mixture applied at evaluation time using an outcome-independent adaptive-λ gate.

parameters: {"order":4,"lambda_high":0.9,"lambda_low":0.05,"threshold":0.9}

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

Novel Contributions

Composes yahya010's stronger SP8192 Transformer base with OE-GOD's byte-level PPM-D mixer
Applies verbatim byte-PPM mixture function during sliding-window evaluation after distributed all-reduce
Uses outcome-independent adaptive-λ gating for the PPM mixture
Disables phased TTT at runtime because PPM mixer is reported to outperform it on this stack
Reports a 3-seed mean validation score of 0.99145 bpb