PR #1322

open

Lucky V — 1.08540457 val_bpb (seed 444)

by newjordanView on GitHub

val_bpb

1.0854

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,438,323 B

Training Techniques

Architecture

GQA

11-layer Transformer with 8 attention heads and 4 KV heads

parameters: {"layers":11,"dim":512,"heads":8,"kv_heads":4}

weight tying

Tied input and output embeddings

parameters: null

RoPE

Rotary positional embeddings with 16 dimensions and base 10000

parameters: {"dimensions":16,"base":10000}

LeakyReLU

LeakyReLU-squared custom activation

parameters: null

XSA

XSA attention used on all 11 layers with Flash Attention 3

parameters: {"layers":11}

BigramHash

Bigram hash table side channel

parameters: {"dimensions":128,"vocab":2048}

VE128

Value embeddings on layers 9-10

parameters: {"layers":[9,10],"dimensions":128}

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz_iterations":7,"muoneq_r":true}

Weight Averaging

SWA

parameters: null

EMA

parameters: null

Quantization

late QAT

bits: 6

scope: weights

Compression

brotli

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"steps":32,"learning_rate":0.04}

Novel Contributions

MuonEq-R row-normalized momentum before Newton-Schulz orthogonalization
Per-head QK gain initialized to 5.0
More Newton-Schulz backend iterations (7 instead of 5)
SLOT32 test-time adaptation with lr=0.04
Sliding-window evaluation with SLOT adaptation
Late QAT with fake int6 quantization
Bigram hash side channel and value embeddings