PR #1492

closed

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

by bigbagView on GitHub

val_bpb

1.0810

Architecture

Transformer

Optimizer

SGD

Artifact Size

~15.99 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

int8

bits: 8

scope: embeddings

Architecture

depth recurrence

3-layer recurrence over layers 3-5 to create more virtual layers from fewer physical layers.

parameters: {"layers":[3,4,5],"virtual_layers":17,"physical_layers":11}

parallel residuals

GPT-J style parallel residual connections where attention and MLP read from the same pre-residual input.

parameters: {"layers":"7+"}

GQA

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

LeakyReLU activation used in the MLP.

parameters: {"slope":0.5}

Partial RoPE

Rotary position embeddings applied to only part of the head dimensions.

parameters: {"dimensions":"16/64"}

weight tying

Input and output embeddings are tied.

parameters: null

MLP3x

Expanded MLP width relative to the base transformer.

parameters: {"multiplier":4}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.005,"epochs":3,"gradient_clipping":1}

AdamW

weight_decay: 0.095

momentum: null

other_params: {"mlr":0.022}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"chunk_size":32000}

LR Schedule

cosine decay

parameters: null

warmdown

parameters: {"warmdown":0.72}

Sequence Length

sequence_length

train_length: 8192

eval_length: 32000

Novel Contributions

SP8192 with GPTQ SDClip quantization and selective pruning
3-layer depth recurrence producing 17 virtual layers from 11 physical layers
Parallel residual connections in later layers
QK-Gain 5.25 with monotonic improvement over lower gains
Legal score-first test-time training under Issue #1017 compliance
Artifact compression via LZMA code wrapper to fit under the size limit