PR #1569

open

[Non Record] Fractal recurrent primitive hybrid - SP1024 1xH100

by abbudjoeView on GitHub
val_bpb
1.3576
Architecture
Hybrid
Optimizer
Artifact Size
14,440,584 bytes

Training Techniques

Architecture
depth recurrence
Replaced a single middle transformer block with a Fractal recurrent primitive in an otherwise transformer-derived 11L/512 SP1024 model.
parameters: {"layers":11,"dimension":512,"schedule":"AAAAAPAAAAA"}
Quantization
int8
bits: 8
scope: all
mixed int6
bits: 6
scope: default export
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
zstd
level: null
Test-Time Training
TTT
parameters: {"mode":"off"}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Used a Triton runtime and block-structured recurrent state path for the Fractal prototype.
parameters: {"backend":"triton","state_blocks":"auto"}

Novel Contributions

  • Controlled non-record ablation replacing one middle transformer block with a Fractal recurrent primitive.
  • Comparison against a pure-attention control under the same SP1024 tokenizer, optimizer, evaluation path, and quantization sweep.
  • Demonstration that all-large-int8 largely removes quantization damage for the recurrent hybrid while staying under the 16MB cap in the 10-minute export.
  • Longer 60-minute probe showing the recurrent primitive continues improving with more wall-clock time, though it remains outside official constraints.
  • Reusable baseline suggesting future recurrent work should use side-channel or context-state insertion rather than direct attention replacement.