PR #274

open

[Record] Stride-32 + Warmdown/Muon Tuning on SOTA #1: mean val_bpb=1.1403

by haikosysView on GitHub

val_bpb

1.1403

Architecture

Transformer

Optimizer

Muon

Artifact Size

under 16MB

Training Techniques

Quantization

mixed int6/int8

bits: 6

scope: MLP, attention, tied embeddings

Architecture

SmearGate

Uses SmearGate in the base architecture.

parameters: null

BigramHash

Adds BigramHash embedding component.

parameters: {"size":10240,"dim":128}

MLP3x

Uses a 3x expanded MLP hidden size.

parameters: {"hidden_size":1536}

tied embeddings

Uses FP16 tied embeddings.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.95

other_params: {"muon_momentum":0.95}

Weight Averaging

SWA

parameters: {"every":50,"start":"40%"}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Initialization

OrthoInit

Orthogonal initialization.

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":5000}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Novel Contributions

Stride-32 sliding window evaluation with 2x context overlap
Warmdown tuning extended to 5000 iterations
Muon momentum tuning from 0.99 to 0.95
Reduced training batch tokens to 524288
LoRA test-time training with rank-8 adapters during evaluation
Per-document adapter reset and score-then-train ordering to preserve causality