PR #1116
openNotable Non-Record: JEPA — 1.4447 BPB — Joint Embedding Predictive Architecture for LLMs
by gowtham0992View on GitHub
val_bpb
1.4447
Architecture
Transformer
Optimizer
—
Artifact Size
10.05 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Evaluation
sliding window eval
parameters: null
Architecture
Transformer
Standard 11-layer causal Transformer used as the base model; no architectural change beyond the training objective.
parameters: {"layers":11}
Other
other
JEPA auxiliary training objective using two randomly masked token views of the same sequence, mean-pooled embeddings, cosine similarity, and asymmetric stop-gradient.
parameters: {"lambda":1,"mask_rate":0.15,"views":2,"forward_passes_per_step":3}
Novel Contributions
- LLM-JEPA auxiliary loss for language model training
- Random token masking to create two corrupted views of the same sequence
- Cosine-similarity embedding matching with asymmetric stop-gradient
- Mean-pooled sequence embeddings for JEPA loss
- Sliding window evaluation