PR #1688

open

Add SP8192 qkramp05 + par-residual L6 + legal TTT systems rerun (1.080885 seed 42)

val_bpb
1.0809
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,993,776 bytes

Training Techniques

Architecture
Partial RoPE
Uses partial rotary positional embeddings with 16/64 dimensions.
parameters: {"dimensions":"16/64"}
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
weight tying
Tied input and output embeddings.
parameters: null
depth recurrence
Uses looped layers / recurrence in the depth stack.
parameters: {"layers":[3,5]}
U-Net skip connections
Includes skip-gated U-Net style connections.
parameters: null
parallel residual lanes
Adds parallel attention/MLP residual lanes starting from layer 6.
parameters: {"start_layer":6}
QK depth ramp
Applies a depth-dependent QK gain ramp instead of a flat gain.
parameters: {"qk_gain_init":5,"qk_gain_depth_ramp":0.5}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: token embeddings
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"matrix_lr":0.022}
LR Schedule
warmdown
parameters: {"warmdown_fraction":0.72}
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"chunk_size":32768,"learning_rate":0.005,"momentum":0.9,"epochs":3}

Novel Contributions

  • SP8192 qkramp05_par0 rerun on a corrected CUDA 12.8 / PyTorch 2.9.1 + flash_attn_3 runtime stack
  • QK depth ramp with gain increasing from 5.0 to 5.5 across depth
  • Parallel residual lanes starting at layer 6
  • Legal score-first TTT on already-scored tokens
  • Systems rerun that recovered throughput and increased training steps under the same wallclock budget