PR #1232

open

feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB)

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.0929

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.47 MB / 14.64 MB

Training Techniques

Architecture

Gated Attention

Attention mechanism with gated QK gain initialization.

parameters: {"qk_gain_init":1.5}

Value Residual

Adds value residual connections to the model.

parameters: null

XSA

Applies XSA to all transformer layers.

parameters: {"layers":11}

BigramHash

Uses hashed bigram embeddings.

parameters: {"buckets":4096,"dim":128}

GQA

Grouped query attention with fewer KV heads than query heads.

parameters: {"query_heads":8,"kv_heads":4}

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

SmearGate

Includes SmearGate in the architecture.

parameters: null

weight tying

Tied input and output embeddings.

parameters: null

Regularization

LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

logit softcap

parameters: {"value":30}

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"embed_lr":0.035}

Compression

zstd

level: 16

Quantization

int6

bits: 6

scope: all

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","learning_rate":0.002,"epochs":10,"chunk_size":32768,"frozen_blocks":2,"grad_clip":1,"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"peak_lr_phase_steps":8000,"warmdown_steps":12000,"warmup_steps":20}

Novel Contributions

20k-step scaling study of the PR940 architecture stack
Legal score-first test-time training achieving 1.0929 BPB
FlowRefiner variant showing the auxiliary flow head is essentially neutral at 20k steps
All-layer XSA, gated attention, value residual, and LeakyReLU² applied at 20k scale
Demonstration that warmdown from 8k to 20k steps drives most of the improvement