PR #1949

open

Non-record: Paper-Folded Recurrent GPT SP8192 5-fold

val_bpb

1.1908

Architecture

Transformer

Optimizer

—

Artifact Size

13,701,318 bytes

Training Techniques

Quantization

int8

bits: 8

scope: model weights

Compression

zlib

level: null

Architecture

depth recurrence

Folded recurrent transformer that reuses shared transformer blocks across 5 recurrent folds with fold-specific modulation.

parameters: {"folds":5,"shared_blocks":2,"fold_state_dim":496,"visible_dim":576,"exit_blocks":4,"stem_blocks":1}

weight tying

Tied input embeddings and output head.

parameters: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Test-Time Training

test_time_training

parameters: null

Paper-folded recurrent transformer architecture
Uses 5 recurrent folds with shared transformer blocks and fold-specific controls
Trades stored parameters for repeated computation under a 16MB artifact budget
Demonstrates a non-record SP8192 run with no TTT
Achieves 1.19084839 BPB with a 13,701,318-byte packaged artifact