PR #2142

open

records(non-record-16mb): JEPA-on-LM 14-run ablation (negative result)

val_bpb

1.2311

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

weight tying

Tied input and output embeddings in the baseline GPT backbone.

parameters: null

ReLU²

Uses relu_sq activation in the backbone.

parameters: null

KV head count

Uses 4 KV heads in the baseline backbone.

parameters: {"num_kv_heads":4}

weight tying

JEPA variants add a small predictor MLP while keeping the same backbone shape across runs.

parameters: {"predictor_hidden_dim":64}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"scalars_optimizer":"Adam"}

LR Schedule

warmdown

parameters: {"warmup_steps":10,"warmdown_iters":1200,"schedule_type":"linear"}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Comprehensive 14-run JEPA ablation showing no improvement over baseline at this scale.
Best JEPA variant exactly ties the same-seed baseline at val_bpb 1.2311.
Demonstrates that lambda/alpha magnitude is the dominant factor, with larger values hurting performance.
Shows that disabling VICReg variance regularization recovers exact parity with baseline.
Finds that token-decoder JEPA and injection variants degrade val_bpb even at small auxiliary weights.
Provides a param-count-clean comparison using only a small predictor MLP added to the same backbone.