val_bpb
1.0812
Architecture
Transformer
Optimizer
—
Artifact Size
15,983,090 bytes
Training Techniques
Architecture
Gated Attention
Pass-gated recurrent attention in the looped band with an extra recurrent attention gate so reused blocks are not exact repeats.
parameters: {"looped_band_layers":"3..5","recur_attn_gate":1,"recur_attn_gate_scale":0.5}
depth recurrence
Recurrent SP8192 stack with looping over a subset of layers and delayed loop activation.
parameters: {"enable_looping_at_step":2600}
Quantization
mixed int6/int8
bits: null
scope: attention and MLP matrices, embeddings
int8
bits: 8
scope: small control tensors
Test-Time Training
full TTT
parameters: {"learning_rate":0.005,"epochs":3}
Other
other
Easy-chunk legal TTT with lighter adaptation on easy chunks and stronger adaptation on harder chunks.
parameters: {"ttt_easy_chunk_ratio":0.998,"ttt_easy_chunk_epochs":1,"ttt_outlier_drop_fraction":0.03,"ttt_score_weight_power":0.5}
other
Late step-based loop onset sweep over multiple candidate onset steps to find the best activation point.
parameters: {"swept_values":[1600,2000,2400,2600,2800,3000],"best_value":2600}
other
Control-int8 packing for small scalar/control tensors to fit under the 16 MB submission limit.
parameters: {"tensors":["attn_scale","mlp_scale","resid_mix","recur_attn_delta","q_gain","skip_weights","skip_gates"]}
Novel Contributions
- Pass-gated recurrent attention in the looped band
- Easy-chunk legal TTT recipe
- Late step-based loop-onset sweep showing 2600 as best among tested values
- Control-int8 packing to fit under the 16 MB limit
- Competitive non-record SP8192 submission with 1.08065825 bpb best single seed