PR #1968

open

Notable non-record: Control+Tail score-first TTT on accepted SP8192 stack

by GotnhubView on GitHub
val_bpb
1.0773
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,990,737 bytes

Training Techniques

Quantization
GPTQ
bits: 6
scope: weights and embeddings
Architecture
depth recurrence
Accepted SP8192 3-layer recurrence stack with recurrence over layers 3-5 in an 11-layer Transformer.
parameters: {"layers":3}
GQA
Transformer uses grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"attention_heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: 0.98
other_params: {"min_lr":0.1}
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"train_last_n_blocks":3,"updated_parameters":8681560,"mode":"control_tail_all"}
Compression
brotli
level: null
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192

Novel Contributions

  • Non-record, unlimited-compute-style submission documenting a legal score-first TTT ablation
  • Control+tail-all TTT that updates only global control/gating parameters and the last 3 transformer blocks
  • Uses only existing model weights with no added adapter modules
  • Shows control-only TTT is too weak while control+tail-all improves the accepted base family on seed 42
  • Maintains score-first evaluation order without SLOT, pre-quant TTT, ETLB, n-gram cache, or logit-bias shortcuts