PR #1492
closedRecord: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)
by bigbagView on GitHub
val_bpb
1.0810
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.99 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
int8
bits: 8
scope: embeddings
Architecture
depth recurrence
3-layer recurrence over layers 3-5 to create more virtual layers from fewer physical layers.
parameters: {"layers":[3,4,5],"virtual_layers":17,"physical_layers":11}
parallel residuals
GPT-J style parallel residual connections where attention and MLP read from the same pre-residual input.
parameters: {"layers":"7+"}
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU activation used in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Rotary position embeddings applied to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
weight tying
Input and output embeddings are tied.
parameters: null
MLP3x
Expanded MLP width relative to the base transformer.
parameters: {"multiplier":4}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.005,"epochs":3,"gradient_clipping":1}
AdamW
weight_decay: 0.095
momentum: null
other_params: {"mlr":0.022}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"chunk_size":32000}
LR Schedule
cosine decay
parameters: null
warmdown
parameters: {"warmdown":0.72}
Sequence Length
sequence_length
train_length: 8192
eval_length: 32000
Novel Contributions
- SP8192 with GPTQ SDClip quantization and selective pruning
- 3-layer depth recurrence producing 17 virtual layers from 11 physical layers
- Parallel residual connections in later layers
- QK-Gain 5.25 with monotonic improvement over lower gains
- Legal score-first test-time training under Issue #1017 compliance
- Artifact compression via LZMA code wrapper to fit under the size limit