PR #2015

open

Non-record: Rank-0 GPTQ + no-TTT CaseOps ablation

by MuhtashamView on GitHub
val_bpb
1.2880
Architecture
Transformer
Optimizer
Artifact Size
15,929,139 bytes

Training Techniques

Test-Time Training
full TTT
parameters: {"enabled":false}
Quantization
GPTQ
bits: null
scope: serialization
Architecture
SmearGate
CaseOps/LQER/SparseAttnGate stack includes sparse attention gating and smear gate components.
parameters: {"gate_window":12,"sparse_attn_gate_scale":0.5}
Gated Attention
Quantized gated attention with sparse attention gate enabled.
parameters: {"quant_gate":1,"enabled":1}
weight tying
Not mentioned explicitly in the submission.
parameters: null
Regularization
weight decay
parameters: {"ttt_weight_decay":0.5}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85,"warmup_steps":20}
Optimizer
Muon
weight_decay: null
momentum: 0.9
other_params: {"muon_backend_steps":5}
Other
other
CaseOps / LQER / SparseAttnGate ablation with rank-0-only GPTQ serialization to avoid duplicated work across torchrun ranks.
parameters: {"posttrain_single_rank":1,"world_size":1}

Novel Contributions

  • Non-record control submission for the PR #1855 CaseOps/LQER/SparseAttnGate stack
  • Disables validation-time TTT while keeping full validation and artifact accounting
  • Rank-0-only GPTQ serialization to avoid duplicated GPTQ work across torchrun ranks
  • Reproducible systems ablation demonstrating successful packaged run with full validation