PR #1569

open

[Non Record] Fractal recurrent primitive hybrid - SP1024 1xH100

val_bpb

1.3576

Architecture

Hybrid

Optimizer

—

Artifact Size

14,440,584 bytes

Training Techniques

Architecture

depth recurrence

Replaced a single middle transformer block with a Fractal recurrent primitive in an otherwise transformer-derived 11L/512 SP1024 model.

parameters: {"layers":11,"dimension":512,"schedule":"AAAAAPAAAAA"}

Quantization

int8

bits: 8

scope: all

mixed int6

bits: 6

scope: default export

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

zstd

level: null

Test-Time Training

TTT

parameters: {"mode":"off"}

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Used a Triton runtime and block-structured recurrent state path for the Fractal prototype.

parameters: {"backend":"triton","state_blocks":"auto"}

Controlled non-record ablation replacing one middle transformer block with a Fractal recurrent primitive.
Comparison against a pure-attention control under the same SP1024 tokenizer, optimizer, evaluation path, and quantization sweep.
Demonstration that all-large-int8 largely removes quantization damage for the recurrent hybrid while staying under the 16MB cap in the 10-minute export.
Longer 60-minute probe showing the recurrent primitive continues improving with more wall-clock time, though it remains outside official constraints.
Reusable baseline suggesting future recurrent work should use side-channel or context-state insertion rather than direct attention replacement.