PR #317

open

Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442)

by chris-buckleyView on GitHub

val_bpb

1.1442

Architecture

Transformer

Optimizer

Muon/AdamW

Artifact Size

under 16 MB

Training Techniques

Architecture

XSA

XSA applied to the last 4 layers

parameters: {"layers":4}

MLP3x

3x MLP width

parameters: null

SmearGate

Uses SmearGate in the model stack

parameters: null

BigramHash

Uses BigramHash auxiliary component with vocabulary size 2048

parameters: {"vocab_size":2048}

KV head count

Uses 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

Quantization

int6 mixed

bits: 6

scope: all

Weight Averaging

EMA

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"adamw_used":true}

AdamW

weight_decay: 0.04

momentum: null

other_params: null

Initialization

OrthoInit

Orthogonal initialization with muP-style output scaling

Evaluation

stride-based sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"momentum":0.9,"freeze_blocks":2}

Compression

zstd

level: 22

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

fixed learning rates

parameters: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Novel Contributions

Adds full-model SGD test-time training on the dequantized checkpoint
Uses EMA instead of SWA in the winning public training stack
Applies XSA to the last 4 layers
Uses stride-64 evaluation
Tunes learning rates upward for matrix, scalar, and tied embedding parameters
Includes compatibility fallbacks for FA3 to SDPA and manual GQA KV-head repeat