PR #1924
closedRecord: PR #1855 base + Smear + LQER + LogitCalib + Phased TTT — val_bpb 1.06080 (3-seed)
by dexhunterView on GitHub
val_bpb
1.0608
Architecture
Transformer
Optimizer
—
Artifact Size
~15.80 MB
Training Techniques
Quantization
GPTQ
bits: null
scope: all
Architecture
SmearGate
BOS-masked smear gate applied in the attention path.
parameters: null
depth recurrence
Triple recurrence / NUM_LOOPS=2 depth recurrence inherited from the PR #1855 family.
parameters: {"num_loops":2}
Gated Attention
Attention gating mechanism used in the lineage.
parameters: null
KV head count
Grouped-query style head configuration with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
RoPE
Rotary positional embeddings used in the model.
parameters: {"rope_base":10000,"rope_dims":16}
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2500}
Evaluation
sliding window eval
parameters: {"stride":64,"eval_length":2048}
Regularization
logit softcap
parameters: {"logit_softcap":30}
logit calibration
parameters: {"scale":true,"categories":14}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Novel Contributions
- Logit calibration: a static affine per-token-category correction fitted on the first 100K train tokens after GPTQ.
- Combination of SmearGate, LQER asymmetric rank-4 correction, and phased TTT on the PR #1855 family.
- Train-tokens-only post-quantization calibration that preserves the full 8192-vocab softmax.
- Record-setting 3-seed mean validation BPB of 1.06080088.