val_bpb
1.6644
Architecture
Transformer
Optimizer
—
Artifact Size
11.74 MB
Training Techniques
Architecture
depth recurrence
Physically instantiates 6 unique Transformer blocks and routes data through them in a palindrome loop to simulate 12 logical layers.
parameters: {"unique_blocks":6,"logical_layers":12}
parallel residuals
Computes attention and MLP branches simultaneously and injects them into the residual stream together to help gradient flow.
parameters: null
Test-Time Training
full TTT
parameters: {"micro_batching":true}
Quantization
QAT
bits: 6
scope: all
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 65536
eval_length: null
LR Schedule
warmup
parameters: {"warmup_steps":20}
Novel Contributions
- Symmetrical modulo routing with palindrome depth recurrence
- Parallel residual computation for improved gradient flow
- TTT micro-batching during evaluation
- 6-bit quantization-aware training for artifact size reduction