PR #1812

open

Records: SP8192 + LegalTTT 4ep — 1.0729 (Δ -0.0081 vs 04-09, p<1e-7)

by EthanNingView on GitHub

val_bpb

1.0729

Architecture

Transformer

Optimizer

Muon

Artifact Size

~16.00 MB

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

depth recurrence

Layer recurrence over a subset of layers during training.

parameters: {"layers":[3,5],"num_loops":2}

parallel residuals

Parallel residual pathway introduced from later layers.

parameters: {"start_layer":7}

XSA

Exclusive self-attention subtracting normalized-V projection of the output.

parameters: null

LeakyReLU

LeakyReLU squared MLP activation.

parameters: {"negative_slope":0.5}

Gated Attention

Per-head attention-output sigmoid gate.

parameters: {"gate_width":12}

Regularization

weight decay

parameters: {"mlp":0.115,"attn":0.095}

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 8

scope: embeddings

Compression

lzma

level: null

brotli

level: 11

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"row_normalized":true,"newton_schulz_steps":5,"nesterov":true}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings and scalars"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

sliding window eval

parameters: {"causal":true}

Test-Time Training

score-first TTT

parameters: {"epochs":4,"learning_rate":0.005,"chunk_size":32000,"momentum":0.9,"nesterov":true}

LR Schedule

warmdown

parameters: {"final_fraction":0.72}

cosine decay

parameters: {"applied_to":"TTT per-chunk LR"}

Sequence Length

sequence_length

train_length: 32000

eval_length: 32000

Novel Contributions

Score-first legal test-time training with 4 epochs per chunk
Split MLP weight decay with stronger regularization on MLP matrices
Per-head attention-output gating
Continuation of the SP8192 + depth recurrence + parallel residuals + QK-Gain stack
GPTQ SDClip quantization with byte-shuffle and Brotli compression