PR #1760

open

Non-record: SP8192 + dim=464 + Pre-Quantization TTT + Brotli (1.1863 BPB)

by BrandtChristianView on GitHub

val_bpb

1.1863

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.92 MB

Training Techniques

Architecture

BigramHash

Uses a bigram hash component in the model stack.

parameters: {"size":1536}

XSA

Applies XSA in the last layers.

parameters: {"last_n_layers":4}

depth recurrence

Adds recurrent looping in selected depth layers.

parameters: {"layers":[3,4,5],"loops":2}

LeakyReLU

Uses LeakyReLU activation in the MLP.

parameters: {"slope":0.5,"mlp_multiplier":3}

parallel residuals

Introduces parallel residual connections starting from a later layer.

parameters: {"start_layer":7}

Quantization

QAT

bits: 6

scope: all layers

int8

bits: 8

scope: embeddings

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"row_normalize":true,"momentum_warmup_start":0.92,"momentum_warmup_steps":500}

Adam

weight_decay: 0.04

momentum: null

other_params: null

Compression

brotli

level: null

Test-Time Training

full TTT

parameters: {"epochs":7,"learning_rate":0.0005,"pre_quantization":true}

score-first TTT

parameters: {"epochs":3,"learning_rate":0.005}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Novel Contributions

Pre-quantization TTT on the full validation set before INT6 quantization
Scaling-law exploration showing improved roundtrip BPB with more preq-TTT epochs
Brotli plus byte-shuffle artifact compression
SP8192-based architecture with BigramHash, XSA, depth recurrence, and parallel residuals
INT6 QAT for all layers with INT8 embeddings
EMA + SWA weight averaging