PR #1532

open

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + Asynchronous Data Loader - val_bpb 1.0803

by nogakerenView on GitHub

val_bpb

1.0803

Architecture

Transformer

Optimizer

SGD

Artifact Size

~15.99 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: weights

int8

bits: 8

scope: embeddings

Architecture

depth recurrence

3-layer recurrence applied to layers 3, 4, and 5, creating virtual layers from physical layers.

parameters: {"layers":3,"activate_at_frac":0.35,"virtual_layers":17,"physical_layers":11}

parallel residuals

GPT-J style parallel residual connections where attention and MLP read from the same input.

parameters: {"layers":"7+"}

QK-Gain

Learnable per-head query scaling.

parameters: {"gain":5.25}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"momentum":0.9}

Optimizer

SGD

weight_decay: 0.095

momentum: 0.9

other_params: {"lr":0.005}

Weight Averaging

EMA

parameters: {"decay":0.9965}

LR Schedule

cosine decay

parameters: {"warmdown":0.72}

Compression

lzma

level: null

Other

other

Asynchronous multi-threaded data loader with producer-consumer queue, batch prefetching, and pinned-memory transfer to hide CPU-to-GPU latency.

parameters: null

Novel Contributions

Migrated ShuffledSequenceLoader next_batch logic to numpy to reduce redundant copies and aten::copy_ overhead.
Implemented a multi-threaded asynchronous producer-consumer batch loader with prefetching and pinned-memory transfers.
Combined SP8192 with 3-layer recurrence, parallel residuals, QK-Gain 5.25, and legal score-first TTT.
Achieved val_bpb 1.0803 with a 3-seed mean under the 16MB artifact limit.