PR #456

open

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.1532

Architecture

10-layer GPT / Transformer

Optimizer

Muon + AdamW

Artifact Size

15,980,085 bytes

Training Techniques

Architecture

BigramHash

Hashes consecutive token pairs into a fixed bucket embedding to provide cheap bigram context.

parameters: {"dimensions":10240}

SmearGate

Sigmoid gating mechanism applied to MLP outputs before residual addition.

parameters: null

XSA

Cross-layer shared attention used in the last 3 layers.

parameters: {"layers":3}

U-Net skip connections

Skip connections between paired layers (e.g., 0↔9, 1↔8) to improve residual flow.

parameters: null

MLP3x

3× expansion MLP with relu² activation.

parameters: {"expansion":3}

KV head count

Grouped-query attention with 8 attention heads and 4 KV heads (2:1 GQA).

parameters: {"heads":8,"kv_heads":4}

depth recurrence

Depth recurrence infrastructure exists but is not active in the final config; no weight sharing used.

parameters: {"unique_layers":10,"num_layers":10}

Quantization

mixed int5/int6

bits: null

scope: MLP int5, attention int6

GPTQ-lite

bits: null

scope: 75% of layers

Late QAT

bits: null

scope: full model

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"used_for":"matrices"}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"tied token embeddings, scalars, and TTT","ttt_lr":0.0005}

Weight Averaging

SWA

parameters: {"start_step":4650,"interval":50}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first full-model TTT

parameters: {"chunk_size":32768,"epochs_per_chunk":1,"learning_rate":0.0005,"freeze_blocks":0,"cosine_decay":true,"persistent_across_documents":true}

LR Schedule

warmup + warmdown + cosine decay

parameters: {"warmup_steps":20,"warmdown_steps":3000,"total_steps":5200}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Regularization

weight decay

parameters: {"weight_decay":0.04}

Novel Contributions

Competition-legal score-first full-model test-time training integrated into sliding-window evaluation
Chunked evaluation loop that scores each chunk before training on it, enabling persistent adaptation across the validation set
Depth recurrence infrastructure included in code but disabled in the final configuration
Mixed int5/int6 quantization with zstd-22 compression to fit within the artifact budget