PR #2081

open

Add LoopFullAttnRes + LoopQ + XSA submission

val_bpb

1.1887

Architecture

Transformer

Optimizer

—

Artifact Size

~14.24 MB

Training Techniques

Architecture

weight tying

Recurrent middle section shares parameters across loop passes in a prelude-core-coda layout.

parameters: {"loops":3,"shared_blocks":2}

depth recurrence

Model uses a recurrent core run for multiple loop passes between prelude and coda blocks.

parameters: {"prelude_blocks":2,"core_blocks":2,"coda_blocks":2,"loop_passes":3}

XSA

Exclusive self-attention removes the self-aligned component from attention output in the recurrent core.

parameters: null

attention residual mixing

Full attention residuals mix prior embedding and earlier loop/depth residual states before attention and MLP sublayers.

parameters: null

learned depth queries

Loop-specific learned queries are used to route over depth/loop history.

parameters: null

Quantization

int8

bits: 8

scope: model

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

10-minute / 16MB leaderboard-format submission with no test-time training.

parameters: {"wallclock_seconds":600,"artifact_limit_bytes":16000000}