PR #2011
openNon-record: Cross-Base Regularizer Transferability — methodological study (20+ cells, 10 figures)
by BharathSShankarView on GitHub
val_bpb
1.0750
Architecture
Transformer
Optimizer
AdamW
Artifact Size
—
Training Techniques
Architecture
weight tying
Uses tied embeddings / tied output weights in the model stack.
parameters: null
depth recurrence
Includes a 3-layer recurrence variant in the companion record lineage.
parameters: {"layers":3}
Regularization
SimCTG
parameters: {"lambda":0.3,"margin":0.4}
QAHSP
parameters: {"lambda":0.3}
ES
parameters: {"lambda":0.05}
AOS
parameters: {"lambda":0.005}
HSU
parameters: {"lambda":0.1}
WBC
parameters: {"lambda":0.005}
WOP
parameters: {"lambda":0.5}
PCS
parameters: {"lambda":0.005}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"learning_rate":null}
Quantization
GPTQ-lite
bits: 6
scope: all
AWQ-lite
bits: 4
scope: per-channel
int4/int6/int8
bits: null
scope: per-tensor/per-row
Compression
lzma
level: null
LR Schedule
cosine decay
parameters: null
warmdown
parameters: {"warmdown_frac":0.85}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"beta2":0.99,"grad_clip_norm":0.3}
Novel Contributions
- Cross-base study showing regularizer benefit does not transfer between two training bases.
- Measurement of a sign flip for QAHSP and ES between Base A and Base B.
- Pipeline-stage attribution showing quant cost is approximately reg-independent on Base A.
- Analysis of regularizer × quantization interactions across multiple quantization schemes.
- Real hidden-state quantization distortion study on Base A trained models.
- Mechanistic checks using SVD spectra, hidden-state trajectories, and CKA to show regs leave a fingerprint upstream of quantization.