PR #1923

open

Record: SP8192 #1855 Base + Asymmetric Logit Rescale — val_bpb 1.06577 (3-seed mean)

by jorge-asenjoView on GitHub

val_bpb

1.0658

Architecture

Transformer

Optimizer

SGD

Artifact Size

15,903,139 B avg

Training Techniques

Architecture

SmearGate

BOS-fixed smear gate used in hidden forward and TTT paths to fix cross-document leakage.

parameters: null

Gated Attention

Sparse/gated attention path applied to attention outputs.

parameters: {"window":12}

weight tying

SP-8192 tokenizer stack includes tied/consistent embedding handling as part of the inherited baseline.

parameters: null

Quantization

mixed int4

bits: 4

scope: LQER asymmetric weights

mixed int7

bits: 7

scope: embeddings

Test-Time Training

score-first TTT

parameters: {"phased":true,"prefix_docs":2500,"num_phases":3}

Regularization

logit softcap

parameters: {"asymmetric":true,"softcap_pos_init":30,"softcap_neg_init":30}

Optimizer

SGD

weight_decay: 0.5

momentum: 0.9

other_params: {"global_ttt_momentum":0.9,"beta2":0.99,"ttt_beta2":0.99}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85,"warmup_steps":20}

Compression

lrzip

level: null

Asymmetric logit rescale on the eval path using separate positive and negative softcaps.
Two learnable scalars, softcap_pos and softcap_neg, replace a single logit_softcap in forward_logits and forward_ttt.
Eval-path asymmetry is initialized to identity behavior by starting both softcaps at 30.0.
Training/fused softcapped CE path is left unchanged, preserving train-time numerics from the #1855 baseline.
The new scalars are serialized in the passthrough float16 list with only 8 bytes of artifact cost.