PR #288

open

Non-record: Hybrid Depth-Recurrent Transformer + Int5 Quantization Studies

by trasnake87View on GitHub
val_bpb
1.2334
Architecture
Hybrid Depth-Recurrent Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
depth recurrence
Hybrid depth-recurrent transformer with 8 physical layers looped twice to create 16 effective depth.
parameters: {"layers":8,"loops":2,"effective_depth":16}
SmearGate
Content-dependent gating variant that modulates token blending using adjacent token embedding similarity.
parameters: {"content_scale":0.1}
Quantization
int5
bits: 5
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
standard eval
parameters: null

Novel Contributions

  • Solved quantization compounding for looped layers, reducing the gap from 0.40 BPB to 0.007.
  • Used a hybrid depth-recurrent transformer with 8 physical layers and 2 loops to achieve 16 effective depth from 20M stored parameters.
  • Added novel input features including word-position, copy flags, and unigram frequency.
  • Studied content-dependent SmearGate and found it to be a negative result at scale due to per-step overhead.
  • Analyzed the tradeoff between content-dependent gating quality gains and wall-clock training efficiency.