PR #1532
openRecord: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + Asynchronous Data Loader - val_bpb 1.0803
by nogakerenView on GitHub
val_bpb
1.0803
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.99 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: weights
int8
bits: 8
scope: embeddings
Architecture
depth recurrence
3-layer recurrence applied to layers 3, 4, and 5, creating virtual layers from physical layers.
parameters: {"layers":3,"activate_at_frac":0.35,"virtual_layers":17,"physical_layers":11}
parallel residuals
GPT-J style parallel residual connections where attention and MLP read from the same input.
parameters: {"layers":"7+"}
QK-Gain
Learnable per-head query scaling.
parameters: {"gain":5.25}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"momentum":0.9}
Optimizer
SGD
weight_decay: 0.095
momentum: 0.9
other_params: {"lr":0.005}
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
cosine decay
parameters: {"warmdown":0.72}
Compression
lzma
level: null
Other
other
Asynchronous multi-threaded data loader with producer-consumer queue, batch prefetching, and pinned-memory transfer to hide CPU-to-GPU latency.
parameters: null
Novel Contributions
- Migrated ShuffledSequenceLoader next_batch logic to numpy to reduce redundant copies and aten::copy_ overhead.
- Implemented a multi-threaded asynchronous producer-consumer batch loader with prefetching and pinned-memory transfers.
- Combined SP8192 with 3-layer recurrence, parallel residuals, QK-Gain 5.25, and legal score-first TTT.
- Achieved val_bpb 1.0803 with a 3-seed mean under the 16MB artifact limit.