PR #1924

closed

Record: PR #1855 base + Smear + LQER + LogitCalib + Phased TTT — val_bpb 1.06080 (3-seed)

by dexhunterView on GitHub

val_bpb

1.0608

Architecture

Transformer

Optimizer

—

Artifact Size

~15.80 MB

Training Techniques

Quantization

GPTQ

bits: null

scope: all

Architecture

SmearGate

BOS-masked smear gate applied in the attention path.

parameters: null

depth recurrence

Triple recurrence / NUM_LOOPS=2 depth recurrence inherited from the PR #1855 family.

parameters: {"num_loops":2}

Gated Attention

Attention gating mechanism used in the lineage.

parameters: null

KV head count

Grouped-query style head configuration with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

RoPE

Rotary positional embeddings used in the model.

parameters: {"rope_base":10000,"rope_dims":16}

Test-Time Training

score-first TTT

parameters: {"phases":3,"prefix_docs":2500}

Evaluation

sliding window eval

parameters: {"stride":64,"eval_length":2048}

Regularization

logit softcap

parameters: {"logit_softcap":30}

logit calibration

parameters: {"scale":true,"categories":14}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Novel Contributions

Logit calibration: a static affine per-token-category correction fitted on the first 100K train tokens after GPTQ.
Combination of SmearGate, LQER asymmetric rank-4 correction, and phased TTT on the PR #1855 family.
Train-tokens-only post-quantization calibration that preserves the full 8192-vocab softmax.
Record-setting 3-seed mean validation BPB of 1.06080088.