PR #288
openNon-record: Hybrid Depth-Recurrent Transformer + Int5 Quantization Studies
by trasnake87View on GitHub
val_bpb
1.2334
Architecture
Hybrid Depth-Recurrent Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
depth recurrence
Hybrid depth-recurrent transformer with 8 physical layers looped twice to create 16 effective depth.
parameters: {"layers":8,"loops":2,"effective_depth":16}
SmearGate
Content-dependent gating variant that modulates token blending using adjacent token embedding similarity.
parameters: {"content_scale":0.1}
Quantization
int5
bits: 5
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
standard eval
parameters: null
Novel Contributions
- Solved quantization compounding for looped layers, reducing the gap from 0.40 BPB to 0.007.
- Used a hybrid depth-recurrent transformer with 8 physical layers and 2 loops to achieve 16 effective depth from 20M stored parameters.
- Added novel input features including word-position, copy flags, and unigram frequency.
- Studied content-dependent SmearGate and found it to be a negative result at scale due to per-step overhead.
- Analyzed the tradeoff between content-dependent gating quality gains and wall-clock training efficiency.