PR #1284

open

Vocab 8192 Entropy Optimized: 1.1207 BPB (3-seed mean)

by tyrel-beedeView on GitHub

val_bpb

1.1207

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.97 MB

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

BigramHash

Adds a complementary bigram transition-statistics channel.

parameters: {"buckets":4096,"dimensions":64}

GQA

Uses grouped query attention with fewer KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

Uses LeakyReLU squared in the MLP.

parameters: {"slope":0.5}

Partial RoPE

Applies rotary position embeddings to a subset of dimensions.

parameters: {"dims":"16/64"}

SmearGate

Uses a position-mixing gate.

parameters: null

U-Net skip connections

Adds encoder-decoder style skip connections.

parameters: null

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

magnitude pruning

parameters: {"values":"±1","selective":true}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_interval":50}

Quantization

GPTQ

bits: 6

scope: MLP/attention body

STE QAT

bits: 6

scope: parameter banks

Compression

lzma

level: 9

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"parameter_banking":true}

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

sliding window eval

parameters: {"stride":16}

Test-Time Training

score-first TTT

parameters: {"stride":64}

score-first TTT

parameters: {"stride":16}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

Increasing the BPE vocabulary from 1,024 to 8,192 tokens as an entropy-optimized scaling variable.
Using mutual-information spectrum analysis of FineWeb to guide vocabulary sizing.
Rebalancing parameters from MLP capacity into a much larger embedding table while keeping the same overall Transformer shape.
Showing that most per-layer techniques become neutral or negative at V=8192, with BigramHash as the main complementary exception.
Demonstrating that quantization precision is the main binding constraint, with int7 improving BPB but exceeding the artifact budget.
Analyzing vocabulary-size and sequence-length substitutability as a joint scaling effect.