PR #1936
open[Record]: MUDD Connections + SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT— val_bpb 1.0769 (3-seed mean)
by hilbertmengView on GitHub
val_bpb
1.0769
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices and part of dynamic dense matrices
GPTQ
bits: 8
scope: token embeddings
Architecture
depth recurrence
3-layer recurrence with virtual layers from 11 physical layers; loops layers 3-5.
parameters: {"layers":[3,4,5]}
parallel residuals
GPT-J style parallel residuals where attention and MLP read from the same pre-residual input.
parameters: {"start_layer":7}
MUDD Connections
Lite multiway dynamic dense connections replacing sigmoid-gated U-Net connections and residual mixing with x0.
parameters: {"query_layers":[2,4,6,8,10,12,15,16],"key_windows":[2,null,2,null,2,null,null,null],"num_ways":[1,1,1,1,1,2,2,1]}
GQA
Switched back to MHA because MUDD Connections prefers a wide V-stream.
parameters: {"kv_heads":8}
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"alpha":0.5,"power":2}
Partial RoPE
Partial rotary position embeddings on a subset of dimensions.
parameters: {"dimensions":"16/64"}
weight tying
Tied input and output embeddings.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}
AdamW
weight_decay: 0.095
momentum: null
other_params: {"mlr":0.022}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
lzma
level: null
brotli
level: 11
Evaluation
sliding window eval
parameters: {"causal":true}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":32000,"optimizer":"SGD"}
LR Schedule
cosine decay
parameters: {"during_ttt":true}
warmdown
parameters: {"final_fraction":0.72}
Novel Contributions
- Lite MUDD Connections with constrained query/key/stream connectivity
- 3-layer depth recurrence with 17 virtual layers from 11 physical layers
- Parallel residuals from layer 7
- QK-Gain 5.25
- Legal score-first test-time training
- Mixed int6/int8 GPTQ quantization with SDClip
- Artifact compression using LZMA and Brotli