PR #1685

open

Non-record: JEPA Hybrid — first latent-prediction LM (1.7622 BPB, 7.5MB)

by butbutt42View on GitHub
val_bpb
1.7622
Architecture
Hybrid
Optimizer
Artifact Size
7.5MB

Training Techniques

Architecture
Hybrid
JEPA + autoregressive language modeling with alternating training steps; JEPA predicts masked token representations in latent space while AR predicts next tokens.
parameters: {"layers":9,"dimensions":384,"heads":6,"predictor_mlp_layers":2}
Weight Averaging
EMA
parameters: {"decay":0.996}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Regularization
weight decay
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • First JEPA-based submission to Parameter Golf
  • Hybrid alternating JEPA and autoregressive training
  • Pure AR fine-tuning for the final 30% of training to align with BPB evaluation
  • Demonstration that latent-space prediction can be trained within the competition constraints
  • Detailed crash analysis and fixes for OOM, inference_mode/torch.compile issues, and disk exhaustion