PR #1556

open

Non-Record: JEPA-NTP Auxiliary Losses (Negative Result)

val_bpb

1.4352

Architecture

Transformer

Optimizer

—

Artifact Size

9.90 MB

Training Techniques

Regularization

spectral floor

parameters: {"lambda":0.01,"eps":0.01}

Other

other

cosine-MSE latent prediction auxiliary loss with stop-gradient target and L2-normalized hidden states

parameters: {"alpha":0.1,"layers":"2-5"}

Architecture

KV head count

Reduced key/value heads from 4 to 1 in MQA variants

parameters: {"kv_heads":1}

Value Residual

Added value embeddings to first/last layers in MQA variants

parameters: null

MLP3x

Increased MLP width multiplier to use freed parameters

parameters: {"multiplier":3}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

JEPA-style auxiliary losses for next-token prediction in the parameter golf regime
Spectral variance floor applied to hidden-state deltas to prevent dimensional collapse
Cosine-MSE latent prediction auxiliary head with stop-gradient target
Experimental analysis showing torch.compile confounded an apparent improvement
MQA plus value embeddings and 3x MLP architectural variants evaluated as alternatives