val_bpb
1.1634
Architecture
Transformer with recurrent tied-depth and FiLM conditioning
Optimizer
—
Artifact Size
15.34MB
Training Techniques
Quantization
int6 QAT
bits: 6
scope: null
Architecture
depth recurrence
8 unique transformer blocks looped 2 times (16 effective layers) with FiLM scale/shift conditioning per iteration
parameters: {"unique_blocks":8,"loops":2,"effective_layers":16}
FiLM conditioning
Learned scale and shift parameters per loop iteration to condition the recurrent blocks
parameters: {"params_count":3072}
BigramHash + TrigramHash
Hashed 2- and 3-token lexical sidecars for richer local context
parameters: {"bigram_vocab_size":20480,"trigram_vocab_size":8192}
U-Net skip connections
Collect skip connections during loop 0 and inject during loop 1
parameters: null
activation
LeakyReLU(0.5) squared activation in MLP
parameters: {"activation":"LeakyReLU(0.5)^2"}
tied embeddings
Input and output embeddings are tied
parameters: null
KV heads
4 KV heads with GQA
parameters: {"kv_heads":4,"attention_heads":8}
Weight Averaging
SWA
parameters: {"checkpoints":24}
Compression
zstd
level: 22
Novel Contributions
- Use of recurrent tied-depth transformer blocks (8 unique blocks looped 2 times) with FiLM conditioning per iteration
- Augmentation with BigramHash and TrigramHash lexical sidecars for richer local context
- Exploration of L(N) optimization frontier by reusing fewer parameters more times and allocating budget to lexical memory
- Demonstration that trigram hashing provides the strongest lexical leverage with significant BPB improvement
- Finding that recurrence is viable and stable with competitive BPB at smaller artifact sizes
- Identification of failure modes for test-time training (TTT) and EMA in recurrent setups
- Discovery of a sweet spot for hash table size where trigram 8192 outperforms 12288