PR #1950
openRecord: Compliant PR #1934 Reproduction (GPTQ_RESERVE=5.5) — val_bpb 1.06003 (3-seed)
by Christopher-Lee-McClendonView on GitHub
val_bpb
1.0600
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,974,305 B
Training Techniques
Quantization
GPTQ
bits: 6
scope: weights
GPTQ
bits: 7
scope: embeddings
Architecture
U-Net skip connections
Adds U-Net style skip connections to the transformer.
parameters: {"layers":11}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Applies rotary position embeddings to only part of the head dimension.
parameters: {"dimensions":16}
depth recurrence
Reuses layers in a recurrent loop over a subset of layers.
parameters: {"loop_layers":[3,5],"num_loops":2}
SmearGate
Adds a smear gate mechanism for attention/control.
parameters: {"window":12}
CaseOps
Uses CaseOps SP8192 bijective case transform.
parameters: {"sp":8192}
LQER
Applies asymmetric low-rank error correction.
parameters: {"rank":4,"top_k":3,"group":64}
sparse attention gate
Uses a sparse attention gating mechanism.
parameters: null
fused CE
Uses fused cross-entropy for training.
parameters: null
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2000,"warm_start_a":1}
Compression
pergroup lrzip
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_for_scalars":true}
Regularization
weight decay
parameters: {"embed_wd":0.06}
LR Schedule
warmdown
parameters: null
Novel Contributions
- Compliance reproduction of PR #1934 with GPTQ reserve increased to 5.5 seconds
- Ensures GPTQ hessian collection completes within the 600s training budget
- Demonstrates near-identical performance to PR #1934 while satisfying timing compliance
- Uses per-group lrzip compression and tightened clip sigmas in the reproduced recipe