PR #288

open

Non-record: Hybrid Depth-Recurrent Transformer + Int5 Quantization Studies

val_bpb

1.2334

Architecture

Hybrid Depth-Recurrent Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

depth recurrence

Hybrid depth-recurrent transformer with 8 physical layers looped twice to create 16 effective depth.

parameters: {"layers":8,"loops":2,"effective_depth":16}

SmearGate

Content-dependent gating variant that modulates token blending using adjacent token embedding similarity.

parameters: {"content_scale":0.1}

Quantization

int5

bits: 5

scope: all

Evaluation

sliding window eval

parameters: {"stride":64}

standard eval

parameters: null

Solved quantization compounding for looped layers, reducing the gap from 0.40 BPB to 0.007.
Used a hybrid depth-recurrent transformer with 8 physical layers and 2 loops to achieve 16 effective depth from 20M stored parameters.
Added novel input features including word-position, copy flags, and unigram frequency.
Studied content-dependent SmearGate and found it to be a negative result at scale due to per-step overhead.
Analyzed the tradeoff between content-dependent gating quality gains and wall-clock training efficiency.