PR #1858

open

Record: Score-First TTT + PPM-D Byte Mixture — mix_bpb 0.9946 (3-seed mean)

by G3sparkyView on GitHub

val_bpb

0.9946

Architecture

Transformer

Optimizer

SGD

Artifact Size

15,997,375 bytes

Training Techniques

Test-Time Training

score-first TTT

parameters: {"epochs_per_chunk":3,"learning_rate":0.005,"momentum":0.9}

Evaluation

sliding window eval

parameters: null

Quantization

GPTQ

bits: 6

scope: attention/MLP; int8 embeddings

Architecture

depth recurrence

Layers 3-5 loop with recurrence

parameters: {"layers":[3,4,5],"num_loops":2,"activated_at_frac":0.35}

weight tying

Tied embeddings

parameters: null

Partial RoPE

Partial rotary positional embeddings

parameters: {"dimensions":"16/64"}

LeakyReLU

Leaky ReLU activation used in MLP

parameters: {"slope":0.5}

XSA

XSA applied on all layers

parameters: null

GQA

Grouped-query attention with 4 KV heads

parameters: {"kv_heads":4}

Regularization

layerwise LN scale

parameters: null

logit softcap

parameters: {"value":30}

Optimizer

SGD

weight_decay: 0.095

momentum: 0.9

other_params: {"muon_variant":"MuonEq-R","newton_schulz_steps":5}

Weight Averaging

EMA

parameters: {"decay":0.9965}

LR Schedule

cosine decay

parameters: {"warmdown_frac":0.72}

Compression

lzma

level: null

brotli

level: 11

Other

other

PPM-D byte mixture with binary-lambda gate for eval-time probability mixing

parameters: {"order":5,"confidence_threshold":0.9,"lambda_high":0.05,"lambda_low":0.9}

Novel Contributions

Legal score-first TTT with 3-epoch SGD per chunk
PPM-D byte mixture with score-before-update ordering
Binary-lambda gate for mixing neural and PPM-D probabilities
Self-extracting LZMA-compressed code wrapper
Brotli-11 model compression