PR #1769

RECORDopen

Record: SP8192 CaseOps stack retune (MLP clip 10→12) → 1.06453

val_bpb

1.0645

Architecture

Transformer

Optimizer

—

Artifact Size

~15.98 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: MLP

Architecture

Gated Attention

Attention with learned scalar out-gates per head and quantized gating enabled.

parameters: {"init_std":0.005,"quant_gate":true}

depth recurrence

Loop4-5 recurrent depth structure.

parameters: {"loop_start":3,"loop_end":5,"num_loops":2}

weight tying

CaseOps + SP8192 stack uses the same tokenizer/model setup as the base submission; no explicit weight tying is described.

parameters: null

Test-Time Training

score-first TTT

parameters: {"phases":3,"prefix_docs":2000}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Regularization

logit softcap

parameters: {"value":30}

weight decay

parameters: null