PR #1204
RECORDopenRecord: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA
by msisovicView on GitHub
val_bpb
1.1063
Architecture
Transformer
Optimizer
—
Artifact Size
~15.94 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: mixed quantization
mixed int6/int8
bits: null
scope: model weights
Architecture
depth recurrence
Repeated a small set of middle layers instead of the full stack; recurrence was activated only partway through training and repeated MLPs were untied.
parameters: {"layers":[4,5],"num_layers":11,"start_step":3000,"untie_mlp":true}
parallel residuals
Split attention and MLP into separate residual lanes starting from a later layer, with learned routing back into both lanes.
parameters: {"start_layer":7}
Evaluation
sliding window eval
parameters: null
Novel Contributions
- Parallel residual lanes for attention and MLP starting at layer 7
- Mini depth recurrence by repeating only middle layers 4 and 5
- Delayed activation of recurrence during training
- Untying repeated MLP weights while keeping the rest of the recurrent block shared
- Mixed-quantization with autoregressive self-generated GPTQ calibration