val_bpb
1.1908
Architecture
Transformer
Optimizer
—
Artifact Size
13,701,318 bytes
Training Techniques
Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null
Architecture
depth recurrence
Folded recurrent transformer that reuses shared transformer blocks across 5 recurrent folds with fold-specific modulation.
parameters: {"folds":5,"shared_blocks":2,"fold_state_dim":496,"visible_dim":576,"exit_blocks":4,"stem_blocks":1}
weight tying
Tied input embeddings and output head.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Test-Time Training
test_time_training
parameters: null
Novel Contributions
- Paper-folded recurrent transformer architecture
- Uses 5 recurrent folds with shared transformer blocks and fold-specific controls
- Trades stored parameters for repeated computation under a 16MB artifact budget
- Demonstrates a non-record SP8192 run with no TTT
- Achieves 1.19084839 BPB with a 13,701,318-byte packaged artifact