PR #2142

open

records(non-record-16mb): JEPA-on-LM 14-run ablation (negative result)

val_bpb
1.2311
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
weight tying
Tied input and output embeddings in the baseline GPT backbone.
parameters: null
ReLU²
Uses relu_sq activation in the backbone.
parameters: null
KV head count
Uses 4 KV heads in the baseline backbone.
parameters: {"num_kv_heads":4}
weight tying
JEPA variants add a small predictor MLP while keeping the same backbone shape across runs.
parameters: {"predictor_hidden_dim":64}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"scalars_optimizer":"Adam"}
LR Schedule
warmdown
parameters: {"warmup_steps":10,"warmdown_iters":1200,"schedule_type":"linear"}
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Comprehensive 14-run JEPA ablation showing no improvement over baseline at this scale.
  • Best JEPA variant exactly ties the same-seed baseline at val_bpb 1.2311.
  • Demonstrates that lambda/alpha magnitude is the dominant factor, with larger values hurting performance.
  • Shows that disabling VICReg variance regularization recovers exact parity with baseline.
  • Finds that token-decoder JEPA and injection variants degrade val_bpb even at small auxiliary weights.
  • Provides a param-count-clean comparison using only a small predictor MLP added to the same backbone.