PR #1616

open

Attempt/qk gain 5.5 deeper recurrence

by VickyrrrrrrView on GitHub

val_bpb

1.4100

Architecture

Transformer

Optimizer

SGD

Artifact Size

11.16 MB

Training Techniques

Quantization

GPTQ

bits: 8

scope: weights

Architecture

depth recurrence

Extended recurrence from 3 layers to 4 layers, looping layers 2-5 to increase virtual depth.

parameters: {"layers":4,"recurrent_layers":[2,3,4,5],"virtual_depth":18}

GQA

Adjusted QK-Gain upward to test monotonic improvement.

parameters: {"qk_gain":5.5}

Parallel Residuals

Uses GPT-J style parallel residual connections from layer 7 onward.

parameters: {"start_layer":7}

Partial RoPE

Applies rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

weight tying

Tied input embeddings and output embeddings.

parameters: null

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"negative_slope":0.5}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3}

Optimizer

SGD

weight_decay: 0.095

momentum: 0.9

other_params: {"mlr":0.022}

Weight Averaging

EMA

parameters: {"decay":0.9965}

LR Schedule

warmdown

parameters: {"warmdown":0.72}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Compression

zlib

level: null

lzma

level: null

Novel Contributions

Pushed QK-Gain from 5.25 to 5.5
Extended depth recurrence from 3 layers to 4 layers (layers 2-5)
Increased virtual depth from 15 to 18 layers
Combined SP8192, parallel residuals, legal score-first TTT, and recurrence in one submission
Used int8 + zlib roundtrip compression to fit artifact constraints