PR #1647

open

SP8192 + SLOT-4 + TTT + 3-Layer Recurrence + Parallel Residuals (1.0616 BPB)

by powerpratikView on GitHub

val_bpb

1.0616

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~16.0MB

Training Techniques

Quantization

mixed int6/int8

bits: 6

scope: model weights

Compression

Brotli

level: null

Architecture

depth recurrence

3-layer depth recurrence activated during training

parameters: {"layers":3}

parallel residuals

GPT-J style parallel residual connections

parameters: null

LeakyReLU

LeakyReLU activation used in the MLP

parameters: {"slope":0.5}

MLP3x

4x MLP expansion in the base stack

parameters: {"multiplier":4}

Test-Time Training

score-first TTT

parameters: {"epochs_per_chunk":3}

Evaluation

sliding window eval

parameters: null

Optimizer

AdamW

weight_decay: 0.01

momentum: null

other_params: {"lr":0.01}

Regularization

weight decay

parameters: {"value":0.01}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Novel Contributions

SLOT (Sample-Level Optimization at Test-time) with per-window logit bias optimization
4-step AdamW optimization of a zero-initialized delta tensor at evaluation time
Combining SLOT with the existing PR #1493 stack to improve validation BPB
3-seed evaluation showing improved mean BPB to 1.0616