PR #267

open

Record: val_bpb: 1.14020 [tested 3x on 8xh100]

by andrewgcodesView on GitHub

val_bpb

1.1374

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,516,237 bytes

Training Techniques

Quantization

int5

bits: 5

scope: all weights

fp16

bits: 16

scope: tied embeddings and last-layer key projections

Architecture

XSA

Exclusive self-attention applied to the last 3 layers by subtracting self-value projection from attention output

parameters: {"layers":3}

SmearGate

Uses SmearGate in the architecture

parameters: null

tied embeddings

Input and output embeddings are tied

parameters: null

KV head count

Grouped-query attention with 4 KV heads

parameters: {"kv_heads":4,"heads":8}

MLP3x

MLP uses 3x expansion

parameters: {"hidden_size":1536}

Optimizer

Muon

weight_decay: 0.08

momentum: 0.99

other_params: {"matrix_lr":0.02}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"epochs_per_chunk":12,"chunks":64,"learning_rate":0.004,"momentum":0.9}

Initialization

OrthoInit

Orthogonal initialization with scaled output projections

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":20}

Regularization

weight decay

parameters: {"value":0.08}

magnitude pruning

parameters: {"sparsity":"3%"}

Compression

zstd

level: null

Novel Contributions

Causal test-time training that evaluates each chunk first and trains only on already-scored tokens
Int5 quantization applied to all weight categories to fit the model under the artifact size limit
EMA-based training for improved model averaging
Exclusive self-attention applied to the last 3 layers
Orthogonal initialization with scaled output projections
Sliding-window evaluation with stride 64
Post quantization roundtrip using int5 + zstd