PR #1616

open

Attempt/qk gain 5.5 deeper recurrence

by VickyrrrrrrView on GitHub
val_bpb
1.4100
Architecture
Transformer
Optimizer
SGD
Artifact Size
11.16 MB

Training Techniques

Quantization
GPTQ
bits: 8
scope: weights
Architecture
depth recurrence
Extended recurrence from 3 layers to 4 layers, looping layers 2-5 to increase virtual depth.
parameters: {"layers":4,"recurrent_layers":[2,3,4,5],"virtual_depth":18}
GQA
Adjusted QK-Gain upward to test monotonic improvement.
parameters: {"qk_gain":5.5}
Parallel Residuals
Uses GPT-J style parallel residual connections from layer 7 onward.
parameters: {"start_layer":7}
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
weight tying
Tied input embeddings and output embeddings.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"negative_slope":0.5}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3}
Optimizer
SGD
weight_decay: 0.095
momentum: 0.9
other_params: {"mlr":0.022}
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
warmdown
parameters: {"warmdown":0.72}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Compression
zlib
level: null
lzma
level: null

Novel Contributions

  • Pushed QK-Gain from 5.25 to 5.5
  • Extended depth recurrence from 3 layers to 4 layers (layers 2-5)
  • Increased virtual depth from 15 to 18 layers
  • Combined SP8192, parallel residuals, legal score-first TTT, and recurrence in one submission
  • Used int8 + zlib roundtrip compression to fit artifact constraints