PR #1744

open

[submission] SP8192 + QK5 + Freeze10 Loss-Gated Legal TTT (1.08885521)

by MuhammedErinArchitectureView on GitHub

val_bpb

1.0889

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,994,383 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: matrices; int8 embeddings

Architecture

KV head count

SP8192 family Transformer with 11 layers, 512 hidden size, 8 attention heads, and 4 KV heads.

parameters: {"layers":11,"dimensions":512,"heads":8,"kv_heads":4}

Test-Time Training

score-first TTT

parameters: {"freeze_blocks":10,"param_mode":"all","loss_gate_mode":"running_mean","loss_gate_margin":0,"final_block_only":true}

Regularization

weight decay

parameters: null

Legal 8xH100 / 10 minute / 16 MB submission in the SP8192 + QK5 + LegalTTT family
Freeze the first 10 transformer blocks during test-time training
Adapt only the final block during test-time training
Use a running-mean loss gate to skip low-value update windows
Demonstrates a competitive single-seed legal submission under runtime and size constraints