PR #1412
RECORDopenNon-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.08354 BPB)
by Robby955View on GitHub
val_bpb
1.0835
Architecture
Transformer
Optimizer
—
Artifact Size
15,978,121
Training Techniques
Architecture
depth recurrence
Depth recurrence with two-phase loop scheduling; first loop enabled at 50% of training and second at 65%.
parameters: {"phase1_at":0.5,"phase2_at":0.65}
parallel residuals
GPT-J style parallel attention and MLP residual path for the last 4 layers, with both branches reading from the same normalized input.
parameters: {"start_layer":7}
Quantization
GPTQ
bits: null
scope: model weights
Regularization
SDClip
parameters: {"hessian_aware":true,"lambda":0.175}
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Parallel residual connections in the last four layers to reduce attention/MLP interference during GPTQ calibration
- Hessian-aware SDClip with row-wise threshold modulation using GPTQ Hessian diagonal importance
- Progressive recurrence with staggered loop activation at 50% and 65% of training
- Cross-seed Hessian analysis showing stable group-level traces but noisy per-row importance