PR #1057

open

11L MLP2x + LeakyReLU² + Legal TTT (val_bpb=1.2201, 3-seed mean, std=0.0015)

by ProgrammerryokiView on GitHub

val_bpb

1.2201

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.0 MB

Training Techniques

Architecture

LeakyReLU

MLP uses LeakyReLU(0.5) squared in a 2x MLP block.

parameters: {"mlp_mult":2,"negative_slope":0.5,"squared":true}

BigramHash

Bigram hash embedding with 4096 buckets.

parameters: {"buckets":4096}

SmearGate

SmearGate enabled in the architecture.

parameters: null

U-Net skip connections

U-Net style skip connections enabled.

parameters: null

XSA

XSA applied in the last 4 layers.

parameters: {"layers":4}

weight tying

Input and output embeddings are tied.

parameters: null

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

STE QAT

bits: 6

scope: all

GPTQ-lite

bits: 6

scope: all

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","learning_rate":0.002,"momentum":0.9,"epochs_per_chunk":7,"chunk_size":32768,"all_blocks_unfrozen":true}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"adamw":true,"lr":0.025}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Novel Contributions

LeakyReLU(0.5) squared MLP activation
Legal score-first TTT with 7 epochs per chunk
Combination of BigramHash, SmearGate, U-Net skips, and XSA in a compact 11-layer model
Int6 QAT plus GPTQ-lite compression to fit under the 16MB artifact limit