PR #1116

open

Notable Non-Record: JEPA — 1.4447 BPB — Joint Embedding Predictive Architecture for LLMs

by gowtham0992View on GitHub

val_bpb

1.4447

Architecture

Transformer

Optimizer

—

Artifact Size

10.05 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Evaluation

sliding window eval

parameters: null

Architecture

Transformer

Standard 11-layer causal Transformer used as the base model; no architectural change beyond the training objective.

parameters: {"layers":11}

Other

other

JEPA auxiliary training objective using two randomly masked token views of the same sequence, mean-pooled embeddings, cosine similarity, and asymmetric stop-gradient.

parameters: {"lambda":1,"mask_rate":0.15,"views":2,"forward_passes_per_step":3}

Novel Contributions

LLM-JEPA auxiliary loss for language model training
Random token masking to create two corrupted views of the same sequence
Cosine-similarity embedding matching with asymmetric stop-gradient
Mean-pooled sequence embeddings for JEPA loss
Sliding window evaluation