PR #1922

open

JEPA Implementation Path: Add Non-Record 10-Minute SP8192 BPE Submission with Self-Contained Data Setup

by divagr18View on GitHub
val_bpb
1.1875
Architecture
Transformer
Optimizer
Artifact Size
15,281,273 bytes

Training Techniques

Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Uses grouped-query attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Evaluation
sliding window eval
parameters: {"stride_frac":0.5}
Quantization
GPTQ
bits: 8
scope: all
Compression
zstd
level: null
Weight Averaging
EMA
parameters: null
Other
other
JEPA-style latent prediction over repeated mid-depth states with gated residual integration and EMA-teacher distillation scheduling.
parameters: {"jepa_enabled":true,"distill_enabled":true,"apply_every":null,"delayed_activation":true}

Novel Contributions

  • JEPA-style latent predictive objective over repeated mid-depth states
  • EMA-teacher distillation with delayed activation
  • Gated residual integration of predicted latent states
  • Self-contained record-local SP8192 dataset setup script
  • Roundtrip-compatible int8 + zstd submission artifact under 16MB