PR #1922

open

JEPA Implementation Path: Add Non-Record 10-Minute SP8192 BPE Submission with Self-Contained Data Setup

val_bpb

1.1875

Architecture

Transformer

Optimizer

—

Artifact Size

15,281,273 bytes

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

GQA

Uses grouped-query attention with fewer KV heads than query heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Evaluation

sliding window eval

parameters: {"stride_frac":0.5}

Quantization

GPTQ

bits: 8

scope: all

Compression

zstd

level: null

Weight Averaging

EMA

parameters: null

Other

other

JEPA-style latent prediction over repeated mid-depth states with gated residual integration and EMA-teacher distillation scheduling.

parameters: {"jepa_enabled":true,"distill_enabled":true,"apply_every":null,"delayed_activation":true}