PR #1116

open

Notable Non-Record: JEPA — 1.4447 BPB — Joint Embedding Predictive Architecture for LLMs

by gowtham0992View on GitHub
val_bpb
1.4447
Architecture
Transformer
Optimizer
Artifact Size
10.05 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Evaluation
sliding window eval
parameters: null
Architecture
Transformer
Standard 11-layer causal Transformer used as the base model; no architectural change beyond the training objective.
parameters: {"layers":11}
Other
other
JEPA auxiliary training objective using two randomly masked token views of the same sequence, mean-pooled embeddings, cosine similarity, and asymmetric stop-gradient.
parameters: {"lambda":1,"mask_rate":0.15,"views":2,"forward_passes_per_step":3}

Novel Contributions

  • LLM-JEPA auxiliary loss for language model training
  • Random token masking to create two corrupted views of the same sequence
  • Cosine-similarity embedding matching with asymmetric stop-gradient
  • Mean-pooled sequence embeddings for JEPA loss
  • Sliding window evaluation