PR #1480

open

[Non-Record] JEPA Baseline — LLM-JEPA pretraining — 1.2699 bpb

by IshiPareekView on GitHub
val_bpb
1.2699
Architecture
Transformer
Optimizer
Artifact Size
135MB

Training Techniques

Architecture
JEPA
Joint Embedding Predictive Architecture for language model pretraining; predicts embeddings instead of tokens using context and target encoders.
parameters: null
EMA
Target encoder is an EMA copy of the context encoder with no gradients.
parameters: null
Regularization
weight decay
parameters: null

Novel Contributions

  • JEPA-based language model pretraining from scratch
  • Predicting embeddings instead of tokens
  • Dual-encoder setup with a gradient-trained context encoder and EMA target encoder
  • Combined CE + 0.1*MSE embedding prediction loss