PR #2098

open

Record: PR #1873 base + tuned PPM gate (T=0.7/H=0.99/L=0.3) — val_bpb 0.80051 (3-seed mean)

by joshuaswansonView on GitHub

val_bpb

0.8005

Architecture

Transformer

Optimizer

—

Artifact Size

<16MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: attention/MLP; embeddings int8

Architecture

weight tying

Tied token embeddings

parameters: null

Partial RoPE

Partial rotary positional embeddings

parameters: {"dimensions":"16/64"}

depth recurrence

Encoder/decoder layer recurrence with repeated layer loops

parameters: {"encoder":[0,1,2,3,4,5,3,4],"decoder":[5,3,4,5,6,7,8,9,10]}

LeakyReLU

LeakyReLU activation with squared variant mentioned in the inherited stack

parameters: {"slope":0.5}

Regularization

layerwise LN scale

parameters: null

Test-Time Training

TTT

parameters: {"learning_rate":0.008,"epochs":4}

Other

other

Causal byte-level PPM-D mixture with tuned confidence gate over NN and PPM log-probabilities

parameters: {"PPM_C":0.7,"PPM_LHI":0.99,"PPM_LLO":0.3,"PPM_ORDER":5}

Offline sweep of PPM gate hyperparameters on dumped NN distribution
Improved causal PPM-D gate settings: PPM_C=0.7, PPM_LHI=0.99, PPM_LLO=0.3
Direct lineage from PR #1873 with byte-identical training pipeline and only runtime hyperparameter changes
3-seed mean validation improvement from 0.82006 to 0.80051