PR #1676

open

Record-track: Trajectory-State Readout + Muon 0.98 + Legal TTT (1.0788)

by aazizyanView on GitHub

val_bpb

1.0788

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

depth recurrence

Looped blocks with three recurrence passes over layers 3-5, using encoder and decoder passes.

parameters: {"loops":3,"layers":[3,4,5]}

weight tying

Tied embeddings are used.

parameters: null

Partial RoPE

Partial rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

LeakyReLU

LeakyReLU activation in the MLP.

parameters: {"slope":0.5}

U-Net skip connections

Sigmoid-gated U-Net style skip connections.

parameters: null

trajectory-state readout

Grouped readout blends hidden states from all three recurrence passes instead of using only the final pass.

parameters: {"groups":16,"learned_parameters":32}

Optimizer

Muon

weight_decay: null

momentum: 0.98

other_params: {"row_normalized":true,"newton_schulz_steps":5}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

sliding window eval

parameters: {"causal":true}

Test-Time Training

full TTT

parameters: {"score_before_update":true,"single_pass":true,"learning_rate":0.005,"momentum":0.9,"epochs_per_chunk":3}

Quantization

GPTQ

bits: 6

scope: model weights

GPTQ

bits: 8

scope: embeddings

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

LR Schedule

warmdown

parameters: {"final_fraction":0.72}

Sequence Length

sequence_length

train_length: 32000

eval_length: null

Novel Contributions

Trajectory-state readout that learns a grouped correction from all three recurrence passes
Step-based loop activation at step 2000 to remove a wallclock curriculum confound
Adoption of Muon momentum 0.98 and tighter GPTQ calibration budget from prior work
Legal score-first full-parameter TTT under the competition constraints