PR #1950

open

Record: Compliant PR #1934 Reproduction (GPTQ_RESERVE=5.5) — val_bpb 1.06003 (3-seed)

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.0600

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,974,305 B

Training Techniques

Quantization

GPTQ

bits: 6

scope: weights

GPTQ

bits: 7

scope: embeddings

Architecture

U-Net skip connections

Adds U-Net style skip connections to the transformer.

parameters: {"layers":11}

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Applies rotary position embeddings to only part of the head dimension.

parameters: {"dimensions":16}

depth recurrence

Reuses layers in a recurrent loop over a subset of layers.

parameters: {"loop_layers":[3,5],"num_loops":2}

SmearGate

Adds a smear gate mechanism for attention/control.

parameters: {"window":12}

CaseOps

Uses CaseOps SP8192 bijective case transform.

parameters: {"sp":8192}

LQER

Applies asymmetric low-rank error correction.

parameters: {"rank":4,"top_k":3,"group":64}

sparse attention gate

Uses a sparse attention gating mechanism.

parameters: null

fused CE

Uses fused cross-entropy for training.

parameters: null

Test-Time Training

score-first TTT

parameters: {"phases":3,"prefix_docs":2000,"warm_start_a":1}

Compression

pergroup lrzip

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adam_for_scalars":true}

Regularization

weight decay

parameters: {"embed_wd":0.06}

LR Schedule

warmdown

parameters: null

Novel Contributions

Compliance reproduction of PR #1934 with GPTQ reserve increased to 5.5 seconds
Ensures GPTQ hessian collection completes within the 600s training budget
Demonstrates near-identical performance to PR #1934 while satisfying timing compliance
Uses per-group lrzip compression and tightened clip sigmas in the reproduced recipe