PR #1412

RECORDopen

Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.08354 BPB)

val_bpb

1.0835

Architecture

Transformer

Optimizer

—

Artifact Size

15,978,121

Training Techniques

Architecture

depth recurrence

Depth recurrence with two-phase loop scheduling; first loop enabled at 50% of training and second at 65%.

parameters: {"phase1_at":0.5,"phase2_at":0.65}

parallel residuals

GPT-J style parallel attention and MLP residual path for the last 4 layers, with both branches reading from the same normalized input.

parameters: {"start_layer":7}

Quantization

GPTQ

bits: null

scope: model weights

Regularization

SDClip

parameters: {"hessian_aware":true,"lambda":0.175}

Evaluation

sliding window eval

parameters: {"stride":64}

Parallel residual connections in the last four layers to reduce attention/MLP interference during GPTQ calibration
Hessian-aware SDClip with row-wise threshold modulation using GPTQ Hessian diagonal importance
Progressive recurrence with staggered loop activation at 50% and 65% of training
Cross-seed Hessian analysis showing stable group-level traces but noisy per-row importance