PR #2080
openRecord: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271
by NewyorkDevView on GitHub
val_bpb
0.9727
Architecture
Transformer
Optimizer
—
Artifact Size
15,999,539 bytes
Training Techniques
Architecture
SmearGate
BOS masking / attention output gating for packed-document cross-boundary safety.
parameters: null
depth recurrence
SP8192 CaseOps + recurrence lineage is part of the inherited stack.
parameters: null
weight tying
Inherited SP8192/tokenizer and compact GPT lineage includes tied/shared embedding-style components.
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
TTT
parameters: {"enabled":false}
Regularization
layerwise LN scale
parameters: {"predictor_heads":true}
Other
other
Training-only bidirectional hop-4 hidden-state prediction auxiliary objective with cosine loss and separate predictor heads, removed from the serialized artifact.
parameters: {"fwd_hops":4,"bwd_hops":4,"weight":0.01,"start_frac":0.35,"end_frac":0.8}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"separate_optimizer_for_auxiliary_module":true,"aux_lr":0.001}
Novel Contributions
- BIJEPAX-lite training-only auxiliary regularizer on top of the SP8192 CaseOps + per-group compression + PPM sliding stack
- Bidirectional hop-4 hidden-state prediction objective with cosine embedding-space loss
- LayerNorm-stabilized predictor heads trained only during a middle portion of the schedule
- Predictor heads are not serialized; final scoring uses the quantized base model with causal PPM sliding evaluation
- Three-seed submission package with all runs under the 16,000,000-byte artifact cap and 600s caps