PR #1676
openRecord-track: Trajectory-State Readout + Muon 0.98 + Legal TTT (1.0788)
by aazizyanView on GitHub
val_bpb
1.0788
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB
Training Techniques
Architecture
depth recurrence
Looped blocks with three recurrence passes over layers 3-5, using encoder and decoder passes.
parameters: {"loops":3,"layers":[3,4,5]}
weight tying
Tied embeddings are used.
parameters: null
Partial RoPE
Partial rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
LeakyReLU
LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
U-Net skip connections
Sigmoid-gated U-Net style skip connections.
parameters: null
trajectory-state readout
Grouped readout blends hidden states from all three recurrence passes instead of using only the final pass.
parameters: {"groups":16,"learned_parameters":32}
Optimizer
Muon
weight_decay: null
momentum: 0.98
other_params: {"row_normalized":true,"newton_schulz_steps":5}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: {"causal":true}
Test-Time Training
full TTT
parameters: {"score_before_update":true,"single_pass":true,"learning_rate":0.005,"momentum":0.9,"epochs_per_chunk":3}
Quantization
GPTQ
bits: 6
scope: model weights
GPTQ
bits: 8
scope: embeddings
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
LR Schedule
warmdown
parameters: {"final_fraction":0.72}
Sequence Length
sequence_length
train_length: 32000
eval_length: null
Novel Contributions
- Trajectory-state readout that learns a grouped correction from all three recurrence passes
- Step-based loop activation at step 2000 to remove a wallclock curriculum confound
- Adoption of Muon momentum 0.98 and tighter GPTQ calibration budget from prior work
- Legal score-first full-parameter TTT under the competition constraints