val_bpb
1.2880
Architecture
Transformer
Optimizer
—
Artifact Size
15,929,139 bytes
Training Techniques
Test-Time Training
full TTT
parameters: {"enabled":false}
Quantization
GPTQ
bits: null
scope: serialization
Architecture
SmearGate
CaseOps/LQER/SparseAttnGate stack includes sparse attention gating and smear gate components.
parameters: {"gate_window":12,"sparse_attn_gate_scale":0.5}
Gated Attention
Quantized gated attention with sparse attention gate enabled.
parameters: {"quant_gate":1,"enabled":1}
weight tying
Not mentioned explicitly in the submission.
parameters: null
Regularization
weight decay
parameters: {"ttt_weight_decay":0.5}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85,"warmup_steps":20}
Optimizer
Muon
weight_decay: null
momentum: 0.9
other_params: {"muon_backend_steps":5}
Other
other
CaseOps / LQER / SparseAttnGate ablation with rank-0-only GPTQ serialization to avoid duplicated work across torchrun ranks.
parameters: {"posttrain_single_rank":1,"world_size":1}
Novel Contributions
- Non-record control submission for the PR #1855 CaseOps/LQER/SparseAttnGate stack
- Disables validation-time TTT while keeping full validation and artifact accounting
- Rank-0-only GPTQ serialization to avoid duplicated GPTQ work across torchrun ranks
- Reproducible systems ablation demonstrating successful packaged run with full validation