PR #2081

open

Add LoopFullAttnRes + LoopQ + XSA submission

by maxrubin629View on GitHub
val_bpb
1.1887
Architecture
Transformer
Optimizer
Artifact Size
~14.24 MB

Training Techniques

Architecture
weight tying
Recurrent middle section shares parameters across loop passes in a prelude-core-coda layout.
parameters: {"loops":3,"shared_blocks":2}
depth recurrence
Model uses a recurrent core run for multiple loop passes between prelude and coda blocks.
parameters: {"prelude_blocks":2,"core_blocks":2,"coda_blocks":2,"loop_passes":3}
XSA
Exclusive self-attention removes the self-aligned component from attention output in the recurrent core.
parameters: null
attention residual mixing
Full attention residuals mix prior embedding and earlier loop/depth residual states before attention and MLP sublayers.
parameters: null
learned depth queries
Loop-specific learned queries are used to route over depth/loop history.
parameters: null
Quantization
int8
bits: 8
scope: model
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
10-minute / 16MB leaderboard-format submission with no test-time training.
parameters: {"wallclock_seconds":600,"artifact_limit_bytes":16000000}

Novel Contributions

  • Prelude-core-coda recurrent layout
  • Full attention residual mixing
  • Loop-specific learned depth queries
  • Exclusive self-attention in the recurrent core
  • Post-int8 evaluation with 3-seed mean reporting