PR #1936

open

[Record]: MUDD Connections + SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT— val_bpb 1.0769 (3-seed mean)

by hilbertmengView on GitHub

val_bpb

1.0769

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices and part of dynamic dense matrices

GPTQ

bits: 8

scope: token embeddings

Architecture

depth recurrence

3-layer recurrence with virtual layers from 11 physical layers; loops layers 3-5.

parameters: {"layers":[3,4,5]}

parallel residuals

GPT-J style parallel residuals where attention and MLP read from the same pre-residual input.

parameters: {"start_layer":7}

MUDD Connections

Lite multiway dynamic dense connections replacing sigmoid-gated U-Net connections and residual mixing with x0.

parameters: {"query_layers":[2,4,6,8,10,12,15,16],"key_windows":[2,null,2,null,2,null,null,null],"num_ways":[1,1,1,1,1,2,2,1]}

GQA

Switched back to MHA because MUDD Connections prefers a wide V-stream.

parameters: {"kv_heads":8}

LeakyReLU

Uses LeakyReLU squared activation.

parameters: {"alpha":0.5,"power":2}

Partial RoPE

Partial rotary position embeddings on a subset of dimensions.

parameters: {"dimensions":"16/64"}

weight tying

Tied input and output embeddings.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}

AdamW

weight_decay: 0.095

momentum: null

other_params: {"mlr":0.022}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

lzma

level: null

brotli

level: 11

Evaluation

sliding window eval

parameters: {"causal":true}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":32000,"optimizer":"SGD"}

LR Schedule

cosine decay

parameters: {"during_ttt":true}

warmdown

parameters: {"final_fraction":0.72}

Novel Contributions

Lite MUDD Connections with constrained query/key/stream connectivity
3-layer depth recurrence with 17 virtual layers from 11 physical layers
Parallel residuals from layer 7
QK-Gain 5.25
Legal score-first test-time training
Mixed int6/int8 GPTQ quantization with SDClip
Artifact compression using LZMA and Brotli