PR #1685
openNon-record: JEPA Hybrid — first latent-prediction LM (1.7622 BPB, 7.5MB)
by butbutt42View on GitHub
val_bpb
1.7622
Architecture
Hybrid
Optimizer
—
Artifact Size
7.5MB
Training Techniques
Architecture
Hybrid
JEPA + autoregressive language modeling with alternating training steps; JEPA predicts masked token representations in latent space while AR predicts next tokens.
parameters: {"layers":9,"dimensions":384,"heads":6,"predictor_mlp_layers":2}
Weight Averaging
EMA
parameters: {"decay":0.996}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Regularization
weight decay
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- First JEPA-based submission to Parameter Golf
- Hybrid alternating JEPA and autoregressive training
- Pure AR fine-tuning for the final 30% of training to align with BPB evaluation
- Demonstration that latent-space prediction can be trained within the competition constraints
- Detailed crash analysis and fixes for OOM, inference_mode/torch.compile issues, and disk exhaustion