PR #1654
open[Non-Record] Modified LLM-JEPA pretraining from scratch — 1.2699 bpbAdd int6 quantization lzma
by IshiPareekView on GitHub
val_bpb
1.2699
Architecture
Transformer
Optimizer
—
Artifact Size
~73MB
Training Techniques
Architecture
EMA
Target encoder is an exponential moving average copy of the context encoder, updated every step.
parameters: {"decay":0.996}
MLP
Small 2-layer predictor maps context embeddings to predicted target embeddings.
parameters: {"layers":2}
other
JEPA-style dual-encoder pretraining with context encoder, target encoder, and embedding-space prediction loss.
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
lzma
level: null
Novel Contributions
- Modified LLM-JEPA pretraining from scratch for the Parameter Golf challenge
- Dual-encoder JEPA setup with EMA target encoder and predictor head
- Embedding-space auxiliary loss combined with cross-entropy loss
- int6 quantization to reduce artifact size
- LZMA compression for additional artifact size reduction