PR #1936

open

[Record]: MUDD Connections + SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT— val_bpb 1.0769 (3-seed mean)

by hilbertmengView on GitHub
val_bpb
1.0769
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: attention/MLP matrices and part of dynamic dense matrices
GPTQ
bits: 8
scope: token embeddings
Architecture
depth recurrence
3-layer recurrence with virtual layers from 11 physical layers; loops layers 3-5.
parameters: {"layers":[3,4,5]}
parallel residuals
GPT-J style parallel residuals where attention and MLP read from the same pre-residual input.
parameters: {"start_layer":7}
MUDD Connections
Lite multiway dynamic dense connections replacing sigmoid-gated U-Net connections and residual mixing with x0.
parameters: {"query_layers":[2,4,6,8,10,12,15,16],"key_windows":[2,null,2,null,2,null,null,null],"num_ways":[1,1,1,1,1,2,2,1]}
GQA
Switched back to MHA because MUDD Connections prefers a wide V-stream.
parameters: {"kv_heads":8}
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"alpha":0.5,"power":2}
Partial RoPE
Partial rotary position embeddings on a subset of dimensions.
parameters: {"dimensions":"16/64"}
weight tying
Tied input and output embeddings.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}
AdamW
weight_decay: 0.095
momentum: null
other_params: {"mlr":0.022}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
lzma
level: null
brotli
level: 11
Evaluation
sliding window eval
parameters: {"causal":true}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":32000,"optimizer":"SGD"}
LR Schedule
cosine decay
parameters: {"during_ttt":true}
warmdown
parameters: {"final_fraction":0.72}

Novel Contributions

  • Lite MUDD Connections with constrained query/key/stream connectivity
  • 3-layer depth recurrence with 17 virtual layers from 11 physical layers
  • Parallel residuals from layer 7
  • QK-Gain 5.25
  • Legal score-first test-time training
  • Mixed int6/int8 GPTQ quantization with SDClip
  • Artifact compression using LZMA and Brotli