PR #1581
openNon-record: JEPA v3 — span-masked I-JEPA + VICReg, val_bpb 1.2321
by aiejvnView on GitHub
val_bpb
1.2321
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
BigramHash
Bigram hash embedding with masked-position zeroing to prevent token identity leakage.
parameters: {"vocab_size":2048}
U-Net skip connections
11-layer U-Net Transformer with encoder-decoder skip connections.
parameters: {"layers":11,"dim":512,"encoder_layers":5,"decoder_layers":6}
GQA
Grouped query attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
RoPE
Rotary positional embeddings.
parameters: null
MLP3x
Transformer MLP expansion multiplier of 3.
parameters: {"mlp_mult":3}
Weight Averaging
EMA
parameters: {"decay":0.9999}
Compression
lzma
level: 9
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_for":"matrix parameters"}
Adam
weight_decay: null
momentum: null
other_params: {"used_for":"scalar parameters"}
Regularization
logit softcap
parameters: {"softcap":30}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Span-masked I-JEPA style auxiliary objective where target spans are replaced with a learned mask embedding in the context encoder, while the target encoder sees the full unmasked sequence.
parameters: {"num_spans":4,"span_len_mean":16,"span_len_min":4,"mask_ratio":0.06}
other
VICReg anti-collapse regularization applied to predictor-side masked representations using variance hinge and covariance penalty.
parameters: {"var_weight":0.15,"cov_weight":0.02,"gamma":1}
Novel Contributions
- Span-masked I-JEPA objective where the context encoder cannot see target tokens
- Learned mask embedding for masked spans with bigram hash leakage prevention
- VICReg variance and covariance regularization on predictor-side masked representations
- Optimizer bug fix that routes JEPAPredictor and jepa_mask_emb into optimizer groups
- Demonstration that properly masked JEPA avoids the collapse seen in prior submissions