PR #552

open

Non-record subimission: RecurrentTiedDepth_8x2_FiLM records

by loveless2001View on GitHub

val_bpb

1.1634

Architecture

Transformer with recurrent tied-depth and FiLM conditioning

Optimizer

—

Artifact Size

15.34MB

Training Techniques

Quantization

int6 QAT

bits: 6

scope: null

Architecture

depth recurrence

8 unique transformer blocks looped 2 times (16 effective layers) with FiLM scale/shift conditioning per iteration

parameters: {"unique_blocks":8,"loops":2,"effective_layers":16}

FiLM conditioning

Learned scale and shift parameters per loop iteration to condition the recurrent blocks

parameters: {"params_count":3072}

BigramHash + TrigramHash

Hashed 2- and 3-token lexical sidecars for richer local context

parameters: {"bigram_vocab_size":20480,"trigram_vocab_size":8192}

U-Net skip connections

Collect skip connections during loop 0 and inject during loop 1

parameters: null

activation

LeakyReLU(0.5) squared activation in MLP

parameters: {"activation":"LeakyReLU(0.5)^2"}

tied embeddings

Input and output embeddings are tied

parameters: null

KV heads

4 KV heads with GQA

parameters: {"kv_heads":4,"attention_heads":8}

Weight Averaging

SWA

parameters: {"checkpoints":24}

Compression

zstd

level: 22

Use of recurrent tied-depth transformer blocks (8 unique blocks looped 2 times) with FiLM conditioning per iteration
Augmentation with BigramHash and TrigramHash lexical sidecars for richer local context
Exploration of L(N) optimization frontier by reusing fewer parameters more times and allocating budget to lexical memory
Demonstration that trigram hashing provides the strongest lexical leverage with significant BPB improvement
Finding that recurrence is viable and stable with competitive BPB at smaller artifact sizes
Identification of failure modes for test-time training (TTT) and EMA in recurrent setups
Discovery of a sweet spot for hash table size where trigram 8192 outperforms 12288