PR #1654

open

[Non-Record] Modified LLM-JEPA pretraining from scratch — 1.2699 bpbAdd int6 quantization lzma

by IshiPareekView on GitHub
val_bpb
1.2699
Architecture
Transformer
Optimizer
Artifact Size
~73MB

Training Techniques

Architecture
EMA
Target encoder is an exponential moving average copy of the context encoder, updated every step.
parameters: {"decay":0.996}
MLP
Small 2-layer predictor maps context embeddings to predicted target embeddings.
parameters: {"layers":2}
other
JEPA-style dual-encoder pretraining with context encoder, target encoder, and embedding-space prediction loss.
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
lzma
level: null

Novel Contributions

  • Modified LLM-JEPA pretraining from scratch for the Parameter Golf challenge
  • Dual-encoder JEPA setup with EMA target encoder and predictor head
  • Embedding-space auxiliary loss combined with cross-entropy loss
  • int6 quantization to reduce artifact size
  • LZMA compression for additional artifact size reduction