PR #2080

open

Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271

by NewyorkDevView on GitHub
val_bpb
0.9727
Architecture
Transformer
Optimizer
Artifact Size
15,999,539 bytes

Training Techniques

Architecture
SmearGate
BOS masking / attention output gating for packed-document cross-boundary safety.
parameters: null
depth recurrence
SP8192 CaseOps + recurrence lineage is part of the inherited stack.
parameters: null
weight tying
Inherited SP8192/tokenizer and compact GPT lineage includes tied/shared embedding-style components.
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
TTT
parameters: {"enabled":false}
Regularization
layerwise LN scale
parameters: {"predictor_heads":true}
Other
other
Training-only bidirectional hop-4 hidden-state prediction auxiliary objective with cosine loss and separate predictor heads, removed from the serialized artifact.
parameters: {"fwd_hops":4,"bwd_hops":4,"weight":0.01,"start_frac":0.35,"end_frac":0.8}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"separate_optimizer_for_auxiliary_module":true,"aux_lr":0.001}

Novel Contributions

  • BIJEPAX-lite training-only auxiliary regularizer on top of the SP8192 CaseOps + per-group compression + PPM sliding stack
  • Bidirectional hop-4 hidden-state prediction objective with cosine embedding-space loss
  • LayerNorm-stabilized predictor heads trained only during a middle portion of the schedule
  • Predictor heads are not serialized; final scoring uses the quantized base model with causal PPM sliding evaluation
  • Three-seed submission package with all runs under the 16,000,000-byte artifact cap and 600s caps