PR #1509
openNon-record: DepthScale — Parameter-Shared Iterative Transformer (1.1962 BPB)
by Lumi-nodeView on GitHub
val_bpb
1.1962
Architecture
Transformer
Optimizer
—
Artifact Size
30MB
Training Techniques
Architecture
depth recurrence
Reuses the same 5 physical transformer layers across multiple iterations to create 10 effective layers of depth with shared weights.
parameters: {"layers":5,"iterations":2,"effective_depth":10}
RoPE
Iteration-aware RoPE shifts positional frequencies by an iteration-dependent offset so repeated passes can learn distinct attention patterns.
parameters: {"epsilon":0.1}
Quantization
int8
bits: 8
scope: all
STE QAT
bits: 4
scope: all
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- Parameter-shared iterative transformer that reuses 5 physical layers across multiple passes
- Iteration-aware RoPE to distinguish different iterations of the shared-depth model
- Demonstration of 10 effective layers of depth at constant parameter cost
- 3-seed reproducibility result of 1.1962 BPB on 8×H100 SXM
- Quantization-aware training with 4-bit STE for robustness to extreme quantization