PR #1493

RECORDopen

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

val_bpb
1.0810
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.99 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
int8
bits: 8
scope: embeddings
Architecture
depth recurrence
3-layer recurrence over layers 3-5, creating virtual layers from physical layers.
parameters: {"layers":[3,4,5],"virtual_layers":17,"physical_layers":11}
parallel residuals
GPT-J style parallel residual pathway where attention and MLP read from the same input.
parameters: {"layers":"7+"}
weight tying
Tied embeddings are used.
parameters: null
LeakyReLU
LeakyReLU activation used in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Partial rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
U-Net skip connections
Skip gates / U-Net-style skip connections are used.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Optimizer
SGD
weight_decay: 0.095
momentum: 0.9
other_params: {"learning_rate":0.005}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3}
LR Schedule
cosine decay
parameters: null
warmdown
parameters: {"warmdown":0.72}

Novel Contributions

  • SP8192 with GPTQ SDClip quantization
  • 3-layer depth recurrence over layers 3-5
  • Parallel residuals from layer 7 onward
  • QK-Gain 5.25
  • Legal score-first test-time training under Issue #1017 constraints
  • Mixed int6/int8 model compression with LZMA code wrapper