PR #456

open

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)

by Christopher-Lee-McClendonView on GitHub
val_bpb
1.1532
Architecture
10-layer GPT / Transformer
Optimizer
Muon + AdamW
Artifact Size
15,980,085 bytes

Training Techniques

Architecture
BigramHash
Hashes consecutive token pairs into a fixed bucket embedding to provide cheap bigram context.
parameters: {"dimensions":10240}
SmearGate
Sigmoid gating mechanism applied to MLP outputs before residual addition.
parameters: null
XSA
Cross-layer shared attention used in the last 3 layers.
parameters: {"layers":3}
U-Net skip connections
Skip connections between paired layers (e.g., 0↔9, 1↔8) to improve residual flow.
parameters: null
MLP3x
3× expansion MLP with relu² activation.
parameters: {"expansion":3}
KV head count
Grouped-query attention with 8 attention heads and 4 KV heads (2:1 GQA).
parameters: {"heads":8,"kv_heads":4}
depth recurrence
Depth recurrence infrastructure exists but is not active in the final config; no weight sharing used.
parameters: {"unique_layers":10,"num_layers":10}
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6
GPTQ-lite
bits: null
scope: 75% of layers
Late QAT
bits: null
scope: full model
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"used_for":"matrices"}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"tied token embeddings, scalars, and TTT","ttt_lr":0.0005}
Weight Averaging
SWA
parameters: {"start_step":4650,"interval":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first full-model TTT
parameters: {"chunk_size":32768,"epochs_per_chunk":1,"learning_rate":0.0005,"freeze_blocks":0,"cosine_decay":true,"persistent_across_documents":true}
LR Schedule
warmup + warmdown + cosine decay
parameters: {"warmup_steps":20,"warmdown_steps":3000,"total_steps":5200}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
weight decay
parameters: {"weight_decay":0.04}

Novel Contributions

  • Competition-legal score-first full-model test-time training integrated into sliding-window evaluation
  • Chunked evaluation loop that scores each chunk before training on it, enabling persistent adaptation across the validation set
  • Depth recurrence infrastructure included in code but disabled in the final configuration
  • Mixed int5/int6 quantization with zstd-22 compression to fit within the artifact budget