PR #1949

open

Non-record: Paper-Folded Recurrent GPT SP8192 5-fold

by ymrohitView on GitHub
val_bpb
1.1908
Architecture
Transformer
Optimizer
Artifact Size
13,701,318 bytes

Training Techniques

Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null
Architecture
depth recurrence
Folded recurrent transformer that reuses shared transformer blocks across 5 recurrent folds with fold-specific modulation.
parameters: {"folds":5,"shared_blocks":2,"fold_state_dim":496,"visible_dim":576,"exit_blocks":4,"stem_blocks":1}
weight tying
Tied input embeddings and output head.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Test-Time Training
test_time_training
parameters: null

Novel Contributions

  • Paper-folded recurrent transformer architecture
  • Uses 5 recurrent folds with shared transformer blocks and fold-specific controls
  • Trades stored parameters for repeated computation under a 16MB artifact budget
  • Demonstrates a non-record SP8192 run with no TTT
  • Achieves 1.19084839 BPB with a 13,701,318-byte packaged artifact