PR #1016

open

11L VRL + Parallel Muon + Legal TTT v2 (val_bpb=1.1269, non-record)

by ADIITJView on GitHub

val_bpb

1.1269

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.8 MB

Training Techniques

Architecture

Value Residual

Layer 0 V output is blended into subsequent layers via learned sigmoid gates.

parameters: {"layers":10}

BigramHash

Bigram hash embedding size increased to improve bpb.

parameters: {"dimensions":3072}

weight tying

Tied embeddings are used.

parameters: null

LeakyReLU

LeakyReLU squared activation is used in the MLP.

parameters: {"slope":0.5}

GQA

Grouped query attention with 4 KV heads.

parameters: {"kv_heads":4}

Weight Averaging

Tight SWA

parameters: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"full_length_windows_only":true,"fixed_scoring_offset":true}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"chunk_size":32000}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

LN scale

parameters: {"scale":"1/sqrt(L+1)"}

Novel Contributions

Value Residual Learning (VRL) with learned sigmoid gates
BigramHash size doubled to 3072
Tight SWA used instead of EMA when snapshots are available
zstd-22 artifact compression
Sliding window evaluation bug fix
TTT enabled by default with all blocks unfrozen
Dropped full GPTQ in favor of GPTQ-lite