PR #1366

open

Non-record: EMA+SWA Tight Averaging with Fused TTT LoRA + Sliding Window (1.1371 BPB)

by yunoshevView on GitHub

val_bpb

1.1371

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.88 MB

Training Techniques

Weight Averaging

EMA + SWA

parameters: {"decay":0.997,"tight_averaging":true,"collect_from":"EMA state","qgrid_lambda":false}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"fused":true}

Evaluation

sliding window eval

parameters: {"stride":256}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"warmup_start_momentum":0.92,"warmup_steps":1500,"warmdown_iters":3500}

Quantization

QAT

bits: null

scope: MLP int5, attention int6

GPTQ

bits: null

scope: full Hessians for all tensors

Architecture

BigramHash

Bigram hash embedding component

parameters: {"size":4096}

VE128

Value embedding / value expansion setting

parameters: {"dimensions":128}

XSA

Uses XSA in the last layers

parameters: {"layers":4}

Partial RoPE

Partial rotary positional embeddings

parameters: {"percent":25}

MLP3x

3x MLP expansion

parameters: null

Sequence Length

sequence_length

train_length: null

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3500}

Novel Contributions

EMA + SWA tight averaging with SWA collected from EMA state
Disabling qgrid_lambda to avoid snapping EMA weights to the quantization grid
Fused TTT LoRA with sliding window evaluation in a single pass
Muon optimizer momentum tuning with warmup and warmdown
Full GPTQ Hessians for all tensors including attention projection and MLP down-projection