PR #552

open

Non-record subimission: RecurrentTiedDepth_8x2_FiLM records

by loveless2001View on GitHub
val_bpb
1.1634
Architecture
Transformer with recurrent tied-depth and FiLM conditioning
Optimizer
Artifact Size
15.34MB

Training Techniques

Quantization
int6 QAT
bits: 6
scope: null
Architecture
depth recurrence
8 unique transformer blocks looped 2 times (16 effective layers) with FiLM scale/shift conditioning per iteration
parameters: {"unique_blocks":8,"loops":2,"effective_layers":16}
FiLM conditioning
Learned scale and shift parameters per loop iteration to condition the recurrent blocks
parameters: {"params_count":3072}
BigramHash + TrigramHash
Hashed 2- and 3-token lexical sidecars for richer local context
parameters: {"bigram_vocab_size":20480,"trigram_vocab_size":8192}
U-Net skip connections
Collect skip connections during loop 0 and inject during loop 1
parameters: null
activation
LeakyReLU(0.5) squared activation in MLP
parameters: {"activation":"LeakyReLU(0.5)^2"}
tied embeddings
Input and output embeddings are tied
parameters: null
KV heads
4 KV heads with GQA
parameters: {"kv_heads":4,"attention_heads":8}
Weight Averaging
SWA
parameters: {"checkpoints":24}
Compression
zstd
level: 22

Novel Contributions

  • Use of recurrent tied-depth transformer blocks (8 unique blocks looped 2 times) with FiLM conditioning per iteration
  • Augmentation with BigramHash and TrigramHash lexical sidecars for richer local context
  • Exploration of L(N) optimization frontier by reusing fewer parameters more times and allocating budget to lexical memory
  • Demonstration that trigram hashing provides the strongest lexical leverage with significant BPB improvement
  • Finding that recurrence is viable and stable with competitive BPB at smaller artifact sizes
  • Identification of failure modes for test-time training (TTT) and EMA in recurrent setups
  • Discovery of a sweet spot for hash table size where trigram 8192 outperforms 12288