PR #2080

open

Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271

by NewyorkDevView on GitHub

val_bpb

0.9727

Architecture

Transformer

Optimizer

—

Artifact Size

15,999,539 bytes

Training Techniques

Architecture

SmearGate

BOS masking / attention output gating for packed-document cross-boundary safety.

parameters: null

depth recurrence

SP8192 CaseOps + recurrence lineage is part of the inherited stack.

parameters: null

weight tying

Inherited SP8192/tokenizer and compact GPT lineage includes tied/shared embedding-style components.

parameters: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

TTT

parameters: {"enabled":false}

Regularization

layerwise LN scale

parameters: {"predictor_heads":true}

Other

other

Training-only bidirectional hop-4 hidden-state prediction auxiliary objective with cosine loss and separate predictor heads, removed from the serialized artifact.

parameters: {"fwd_hops":4,"bwd_hops":4,"weight":0.01,"start_frac":0.35,"end_frac":0.8}

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"separate_optimizer_for_auxiliary_module":true,"aux_lr":0.001}

Novel Contributions

BIJEPAX-lite training-only auxiliary regularizer on top of the SP8192 CaseOps + per-group compression + PPM sliding stack
Bidirectional hop-4 hidden-state prediction objective with cosine embedding-space loss
LayerNorm-stabilized predictor heads trained only during a middle portion of the schedule
Predictor heads are not serialized; final scoring uses the quantized base model with causal PPM sliding evaluation
Three-seed submission package with all runs under the 16,000,000-byte artifact cap and 600s caps