PR #1697

open

Add non-record SP8192 pass-gated recurrence submission

val_bpb

1.0812

Architecture

Transformer

Optimizer

—

Artifact Size

15,983,090 bytes

Training Techniques

Architecture

Gated Attention

Pass-gated recurrent attention in the looped band with an extra recurrent attention gate so reused blocks are not exact repeats.

parameters: {"looped_band_layers":"3..5","recur_attn_gate":1,"recur_attn_gate_scale":0.5}

depth recurrence

Recurrent SP8192 stack with looping over a subset of layers and delayed loop activation.

parameters: {"enable_looping_at_step":2600}

Quantization

mixed int6/int8

bits: null

scope: attention and MLP matrices, embeddings

int8

bits: 8

scope: small control tensors

Test-Time Training

full TTT

parameters: {"learning_rate":0.005,"epochs":3}

Other

other

Easy-chunk legal TTT with lighter adaptation on easy chunks and stronger adaptation on harder chunks.

parameters: {"ttt_easy_chunk_ratio":0.998,"ttt_easy_chunk_epochs":1,"ttt_outlier_drop_fraction":0.03,"ttt_score_weight_power":0.5}

other

Late step-based loop onset sweep over multiple candidate onset steps to find the best activation point.

parameters: {"swept_values":[1600,2000,2400,2600,2800,3000],"best_value":2600}

other

Control-int8 packing for small scalar/control tensors to fit under the 16 MB submission limit.

parameters: {"tensors":["attn_scale","mlp_scale","resid_mix","recur_attn_delta","q_gain","skip_weights","skip_gates"]}