PR #1923

open

Record: SP8192 #1855 Base + Asymmetric Logit Rescale — val_bpb 1.06577 (3-seed mean)

by jorge-asenjoView on GitHub
val_bpb
1.0658
Architecture
Transformer
Optimizer
SGD
Artifact Size
15,903,139 B avg

Training Techniques

Architecture
SmearGate
BOS-fixed smear gate used in hidden forward and TTT paths to fix cross-document leakage.
parameters: null
Gated Attention
Sparse/gated attention path applied to attention outputs.
parameters: {"window":12}
weight tying
SP-8192 tokenizer stack includes tied/consistent embedding handling as part of the inherited baseline.
parameters: null
Quantization
mixed int4
bits: 4
scope: LQER asymmetric weights
mixed int7
bits: 7
scope: embeddings
Test-Time Training
score-first TTT
parameters: {"phased":true,"prefix_docs":2500,"num_phases":3}
Regularization
logit softcap
parameters: {"asymmetric":true,"softcap_pos_init":30,"softcap_neg_init":30}
Optimizer
SGD
weight_decay: 0.5
momentum: 0.9
other_params: {"global_ttt_momentum":0.9,"beta2":0.99,"ttt_beta2":0.99}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85,"warmup_steps":20}
Compression
lrzip
level: null

Novel Contributions

  • Asymmetric logit rescale on the eval path using separate positive and negative softcaps.
  • Two learnable scalars, softcap_pos and softcap_neg, replace a single logit_softcap in forward_logits and forward_ttt.
  • Eval-path asymmetry is initialized to identity behavior by starting both softcaps at 30.0.
  • Training/fused softcapped CE path is left unchanged, preserving train-time numerics from the #1855 baseline.
  • The new scalars are serialized in the passthrough float16 list with only 8 bytes of artifact cost.