PR #1688

open

Add SP8192 qkramp05 + par-residual L6 + legal TTT systems rerun (1.080885 seed 42)

by Buld1nView on GitHub

val_bpb

1.0809

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,993,776 bytes

Training Techniques

Architecture

Partial RoPE

Uses partial rotary positional embeddings with 16/64 dimensions.

parameters: {"dimensions":"16/64"}

LeakyReLU

Uses LeakyReLU activation in the MLP.

parameters: {"slope":0.5}

weight tying

Tied input and output embeddings.

parameters: null

depth recurrence

Uses looped layers / recurrence in the depth stack.

parameters: {"layers":[3,5]}

U-Net skip connections

Includes skip-gated U-Net style connections.

parameters: null

parallel residual lanes

Adds parallel attention/MLP residual lanes starting from layer 6.

parameters: {"start_layer":6}

QK depth ramp

Applies a depth-dependent QK gain ramp instead of a flat gain.

parameters: {"qk_gain_init":5,"qk_gain_depth_ramp":0.5}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 8

scope: token embeddings

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"matrix_lr":0.022}

LR Schedule

warmdown

parameters: {"warmdown_fraction":0.72}

Regularization

layerwise LN scale

parameters: null

logit softcap

parameters: {"value":30}

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"chunk_size":32768,"learning_rate":0.005,"momentum":0.9,"epochs":3}

Novel Contributions

SP8192 qkramp05_par0 rerun on a corrected CUDA 12.8 / PyTorch 2.9.1 + flash_attn_3 runtime stack
QK depth ramp with gain increasing from 5.0 to 5.5 across depth
Parallel residual lanes starting at layer 6
Legal score-first TTT on already-scored tokens
Systems rerun that recovered throughput and increased training steps under the same wallclock budget