PR #1480
open[Non-Record] JEPA Baseline — LLM-JEPA pretraining — 1.2699 bpb
by IshiPareekView on GitHub
val_bpb
1.2699
Architecture
Transformer
Optimizer
—
Artifact Size
135MB
Training Techniques
Architecture
JEPA
Joint Embedding Predictive Architecture for language model pretraining; predicts embeddings instead of tokens using context and target encoders.
parameters: null
EMA
Target encoder is an EMA copy of the context encoder with no gradients.
parameters: null
Regularization
weight decay
parameters: null
Novel Contributions
- JEPA-based language model pretraining from scratch
- Predicting embeddings instead of tokens
- Dual-encoder setup with a gradient-trained context encoder and EMA target encoder
- Combined CE + 0.1*MSE embedding prediction loss