PR #688

open

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)

by RoyiRaView on GitHub

val_bpb

1.0745

Architecture

Transformer

Optimizer

AdamW

Artifact Size

<15.5 MB

Training Techniques

Quantization

GPTQ

bits: 5

scope: all

Architecture

BigramHash

Hashed bigram table used as part of the 5-expert context mixer / model additions.

parameters: {"size":6144,"dim":128}

XSA

Applied across all layers.

parameters: {"layers":11,"window_size":8}

Partial RoPE

Rotary positional embeddings applied partially.

parameters: {"dimensions":"16/64"}

MLP3x

Three-layer MLP with LeakyReLU activation.

parameters: {"activation":"LeakyReLU(0.5)^2"}

VE128

VE128 enabled in later layers.

parameters: {"layers":[9,10]}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":32,"seq_len":2048}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0001,"chunk_tokens":131072,"epochs":3,"polyak_decay":0.998,"frozen_blocks":9}

Sequence Length

sequence_length

train_length: 131072

eval_length: 2048

LR Schedule

cosine decay

parameters: {"adaptive_lr_max_mult":3}

Regularization

layerwise LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Other

other

5-expert Hedge/multiplicative-weights logistic context mixer blending neural, unigram, bigram, trigram, and entropy experts in log-probability space.

parameters: {"eta":0.1}

Novel Contributions

5-expert Hedge-based logistic context mixer
Online GPU-vectorized context mixing in log-probability space
Incremental n-gram tables built only from already-scored tokens
Score-first test-time training pipeline
GPTQ-calibrated model with int5 quantization and zstd compression