PR #2020

open

Record: PR1851 + 9-hparam stack + wd_strong + GPTQ AR + pergroup - val_bpb 1.05957 (1 seed)

by ItssshikharView on GitHub
val_bpb
1.0596
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,901,624 bytes

Training Techniques

Architecture
SmearGate
Preserves the PR #1851 graph with BOS-fixed SmearGate, LQER asymmetric, and SparseAttnGate/PolarNS/FusedCE stack.
parameters: null
weight tying
Uses tied embeddings / tied embedding in the CaseOps stack.
parameters: null
Quantization
GPTQ
bits: 6
scope: all ranks / model weights
GPTQ
bits: 7
scope: embeddings
Optimizer
Muon
weight_decay: 0.5
momentum: null
other_params: {"wd_schedule_enabled":true,"wd_sched_low_factor":0.5,"wd_sched_high_factor":1.75}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Regularization
weight decay
parameters: {"schedule":"wd_strong","low_factor":0.5,"high_factor":1.75}
Test-Time Training
LoRA TTT
parameters: {"phased":true,"num_phases":3,"rank":80,"prefix_docs":2500}
Compression
custom
level: null
Evaluation
phased TTT eval
parameters: {"phases":3}
Other
other
GPTQ all-rank Hessian averaging across ranks during calibration.
parameters: {"all_reduce":true}
other
Per-group lrzip+brotli compression ported into the PR #1851 graph to fit under the 16 MB cap.
parameters: {"compressor":"pergroup"}

Novel Contributions

  • PR #1851 graph preserved while importing PR #1855's 9-hparam stack
  • Stronger Muon weight-decay schedule ('wd_strong')
  • GPTQ all-rank Hessian averaging
  • Port of PR #1855 pergroup lrzip+brotli compressor into the PR #1851 graph
  • Valid-size recovery of a previously over-cap run with nearly identical val_bpb