PR #1412

RECORDopen

Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.08354 BPB)

by Robby955View on GitHub
val_bpb
1.0835
Architecture
Transformer
Optimizer
Artifact Size
15,978,121

Training Techniques

Architecture
depth recurrence
Depth recurrence with two-phase loop scheduling; first loop enabled at 50% of training and second at 65%.
parameters: {"phase1_at":0.5,"phase2_at":0.65}
parallel residuals
GPT-J style parallel attention and MLP residual path for the last 4 layers, with both branches reading from the same normalized input.
parameters: {"start_layer":7}
Quantization
GPTQ
bits: null
scope: model weights
Regularization
SDClip
parameters: {"hessian_aware":true,"lambda":0.175}
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Parallel residual connections in the last four layers to reduce attention/MLP interference during GPTQ calibration
  • Hessian-aware SDClip with row-wise threshold modulation using GPTQ Hessian diagonal importance
  • Progressive recurrence with staggered loop activation at 50% and 65% of training
  • Cross-seed Hessian analysis showing stable group-level traces but noisy per-row importance