PR #849

open

Record: 11L Int5 + 6-Expert HedgeMixer + LeakyReLU(0.9)^2 + TTT (val_bpb=1.1105)

by dttdrvView on GitHub

val_bpb

1.1105

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.95 MB

Training Techniques

Quantization

GPTQ

bits: 5

scope: all blocks

Architecture

MLP3x

3.5x MLP with LeakyReLU(0.9)^2 activation

parameters: {"hidden":1792}

tied embeddings

Input and output embeddings are tied

parameters: null

GatedAttention

Per-head learned scalar gate in attention

parameters: null

ValueResidual

Per-block learned x0 injection / residual value path

parameters: null

XSA

XSA applied on all 11 layers

parameters: {"layers":11}

SmearGate

Additional gating mechanism used in the model

parameters: null

BigramHash

Hashed bigram feature embedding

parameters: {"size":8192,"dim":128}

ValueEmbedding

Value embedding used on later layers

parameters: {"dim":128,"layers":[9,10]}

Partial RoPE

Rotary positional embeddings applied partially

parameters: {"dimensions":"16/64"}

LN Scale

LayerNorm scaling modification

parameters: null

U-Net encoder-decoder

U-Net style encoder-decoder with learned skip weights

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.92

other_params: {"adamw_weight_decay":0.04,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"momentum_schedule":"0.92->0.99","warmup_steps":1500}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"warmup_steps":1500}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"type":"Tight SWA","scale_threshold":0.2,"every_n_steps":50}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Test-Time Training

score-first TTT

parameters: {"epochs":4,"optimizer":"AdamW","learning_rate":0.0005,"freeze_blocks":2,"byte_weighted":true,"polyak_averaging":0.998,"adaptive_cosine_lr":true}

Initialization

OrthoInit

Used with muP initialization scheme

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmup_steps":1500,"warmdown_iters":3500}

Regularization

weight decay

parameters: {"value":0.04}

CROWN-Q

parameters: null

Late QAT soft-round STE

parameters: {"quantization_bits":5,"scale_threshold":0.5}

Other

other

6-expert Hedge context mixer combining neural, unigram, bigram, trigram, 4-gram, and entropy experts

parameters: {"experts":6}

Novel Contributions

6-expert HedgeMixer context mixer
LeakyReLU(0.9)^2 activation
GatedAttention with per-head learned scalar gates
ValueResidual and XSA across all 11 layers
Partial RoPE with 16/64 dimensions
BigramHash and ValueEmbedding features
Late QAT soft-round with CROWN-Q regularization
GPTQ int5 quantization with 3% pruning and zstd-22 compression
Score-first TTT with byte-weighted loss and Polyak averaging