PR #2015

open

Non-record: Rank-0 GPTQ + no-TTT CaseOps ablation

val_bpb

1.2880

Architecture

Transformer

Optimizer

—

Artifact Size

15,929,139 bytes

Training Techniques

Test-Time Training

full TTT

parameters: {"enabled":false}

Quantization

GPTQ

bits: null

scope: serialization

Architecture

SmearGate

CaseOps/LQER/SparseAttnGate stack includes sparse attention gating and smear gate components.

parameters: {"gate_window":12,"sparse_attn_gate_scale":0.5}

Gated Attention

Quantized gated attention with sparse attention gate enabled.

parameters: {"quant_gate":1,"enabled":1}

weight tying

Not mentioned explicitly in the submission.

parameters: null

Regularization

weight decay

parameters: {"ttt_weight_decay":0.5}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85,"warmup_steps":20}

Optimizer

Muon

weight_decay: null

momentum: 0.9

other_params: {"muon_backend_steps":5}

Other

other

CaseOps / LQER / SparseAttnGate ablation with rank-0-only GPTQ serialization to avoid duplicated work across torchrun ranks.

parameters: {"posttrain_single_rank":1,"world_size":1}

Non-record control submission for the PR #1855 CaseOps/LQER/SparseAttnGate stack
Disables validation-time TTT while keeping full validation and artifact accounting
Rank-0-only GPTQ serialization to avoid duplicated GPTQ work across torchrun ranks
Reproducible systems ablation demonstrating successful packaged run with full validation