val_bpb
1.1184
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,882,595 bytes
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU squared activation in the model.
parameters: {"power":2,"slope":0.5}
BigramHash
Uses bigram hashing embedding.
parameters: {"vocab_size":1536}
XSA
Uses XSA in the last layers.
parameters: {"last_n_layers":4}
Partial RoPE
Applies RoPE only partially.
parameters: {"dimensions":16}
VE128
Uses value residual embeddings/paths with dimension 128.
parameters: {"dimension":128}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Quantization
late QAT
bits: 6
scope: model
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0025,"epochs":4,"chunk_tokens":32768,"momentum":0.9,"freeze_blocks":0,"batch_seqs":32,"grad_clip":1}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine decay
parameters: {"ttt":true}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
LN scale
parameters: {"enabled":true}
Novel Contributions
- Improved Legal TTT submission based on the prior LeakyReLU LegalTTT Parallel Muon run
- Increased legal TTT learning rate from 0.002 to 0.0025
- Increased legal TTT epochs from 3 to 4
- Skipped diagnostic pre-TTT evaluations to keep evaluation under the time limit
- Added eval-only checkpoint loading for faster TTT sweeps
- Combined LeakyReLU² with Parallel Muon, EMA, SWA, and late QAT