PR #1968

open

Notable non-record: Control+Tail score-first TTT on accepted SP8192 stack

val_bpb

1.0773

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,990,737 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: weights and embeddings

Architecture

depth recurrence

Accepted SP8192 3-layer recurrence stack with recurrence over layers 3-5 in an 11-layer Transformer.

parameters: {"layers":3}

GQA

Transformer uses grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"attention_heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: 0.98

other_params: {"min_lr":0.1}

LR Schedule

warmdown

parameters: {"min_lr":0.1}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"train_last_n_blocks":3,"updated_parameters":8681560,"mode":"control_tail_all"}

Compression

brotli

level: null

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

Non-record, unlimited-compute-style submission documenting a legal score-first TTT ablation
Control+tail-all TTT that updates only global control/gating parameters and the last 3 transformer blocks
Uses only existing model weights with no added adapter modules
Shows control-only TTT is too weak while control+tail-all improves the accepted base family on seed 42
Maintains score-first evaluation order without SLOT, pre-quant TTT, ETLB, n-gram cache, or logit-bias shortcuts