PR #1959

open

Record: SP8192 + 3-layer recurrence + byte-PPM mixer — val_bpb 0.99621 (3-seed mean)

by remg1997View on GitHub

val_bpb

0.9962

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.997 MB

Training Techniques

Architecture

depth recurrence

3-layer recurrence stack with parallel residuals and repeated layer loops in the Transformer backbone.

parameters: {"layers":3}

weight tying

Tied token embeddings.

parameters: null

Partial RoPE

Partial rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

LeakyReLU

LeakyReLU squared activation in the MLP.

parameters: {"squared":true}

GQA

Grouped-query attention with 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

MLP width multiplier of 4x in the Transformer block.

parameters: {"multiplier":4}

Quantization

GPTQ

bits: 6

scope: all attention/MLP matrices

GPTQ

bits: 8

scope: token embeddings

Regularization

layerwise LN scale

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings/scalars"}

LR Schedule

warmdown

parameters: {"final_fraction":0.72}

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Causal byte-level PPM-D order-4 mixer applied at evaluation time with an outcome-independent adaptive lambda gate.

parameters: {"order":4,"eval_time_only":true}

Test-Time Training

score-first TTT

parameters: {"enabled_for_diagnostics":true,"disabled_in_submission":true}

Novel Contributions

Combines SP8192 3-layer recurrence stack with a corrected causal byte-level PPM mixer.
Uses an outcome-independent adaptive gate for mixing neural byte probabilities with PPM probabilities.
Reports a 3-seed mean val_bpb of 0.99621, with per-seed logs and compliance verification.
Disables TTT in the submitted artifact to meet the eval-time budget while preserving the PPM mixer result.