val_bpb
1.1412
Architecture
Transformer
Optimizer
Muon
Artifact Size
13,841,922 bytes
Training Techniques
Architecture
depth recurrence
Universal Transformer-style recurrence with 6 unique blocks applied 4 times each for 24 effective layers.
parameters: {"unique_blocks":6,"recurrence_steps":4,"effective_layers":24}
weight tying
Shared block weights reused across recurrence steps.
parameters: null
BigramHash
Bigram context embeddings indexed by hashed token bigrams.
parameters: {"buckets":2048,"dim":512}
U-Net skip connections
Stored intermediate states are reused in reverse order during later recurrence steps.
parameters: {"steps":24}
LeakyReLU
Uses LeakyReLU(0.5) squared activation in the MLP.
parameters: {"alpha":0.5}
GQA
Grouped query attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"kv_heads":4}
Quantization
INT6 QAT
bits: 6
scope: block weights
STE QAT
bits: 6
scope: all linear weights
GPTQ
bits: 6
scope: linear layers
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"lr":0.04}
Compression
zlib
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- Universal Transformer-style depth recurrence for parameter sharing across layers
- FiLM conditioning to specialize shared blocks across recurrence steps
- U-Net skip connections adapted to the recurrence loop
- BigramHash embeddings for cheap local context modeling
- LeakyReLU(0.5)^2 activation for tied-weight recurrence
- INT6 QAT with STE and GPTQ-style per-row clipping
- Muon optimization with weight decay tuned for compression
- Sliding-window evaluation with stride 64