PR #2020

open

Record: PR1851 + 9-hparam stack + wd_strong + GPTQ AR + pergroup - val_bpb 1.05957 (1 seed)

by ItssshikharView on GitHub

val_bpb

1.0596

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,901,624 bytes

Training Techniques

Architecture

SmearGate

Preserves the PR #1851 graph with BOS-fixed SmearGate, LQER asymmetric, and SparseAttnGate/PolarNS/FusedCE stack.

parameters: null

weight tying

Uses tied embeddings / tied embedding in the CaseOps stack.

parameters: null

Quantization

GPTQ

bits: 6

scope: all ranks / model weights

GPTQ

bits: 7

scope: embeddings

Optimizer

Muon

weight_decay: 0.5

momentum: null

other_params: {"wd_schedule_enabled":true,"wd_sched_low_factor":0.5,"wd_sched_high_factor":1.75}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Regularization

weight decay

parameters: {"schedule":"wd_strong","low_factor":0.5,"high_factor":1.75}

Test-Time Training

LoRA TTT

parameters: {"phased":true,"num_phases":3,"rank":80,"prefix_docs":2500}

Compression

custom

level: null

Evaluation

phased TTT eval

parameters: {"phases":3}

Other

other

GPTQ all-rank Hessian averaging across ranks during calibration.

parameters: {"all_reduce":true}

other

Per-group lrzip+brotli compression ported into the PR #1851 graph to fit under the 16 MB cap.

parameters: {"compressor":"pergroup"}

Novel Contributions

PR #1851 graph preserved while importing PR #1855's 9-hparam stack
Stronger Muon weight-decay schedule ('wd_strong')
GPTQ all-rank Hessian averaging
Port of PR #1855 pergroup lrzip+brotli compressor into the PR #1851 graph
Valid-size recovery of a previously over-cap run with nearly identical val_bpb