PR #1556

open

Non-Record: JEPA-NTP Auxiliary Losses (Negative Result)

by sidhanth97View on GitHub
val_bpb
1.4352
Architecture
Transformer
Optimizer
Artifact Size
9.90 MB

Training Techniques

Regularization
spectral floor
parameters: {"lambda":0.01,"eps":0.01}
Other
other
cosine-MSE latent prediction auxiliary loss with stop-gradient target and L2-normalized hidden states
parameters: {"alpha":0.1,"layers":"2-5"}
Architecture
KV head count
Reduced key/value heads from 4 to 1 in MQA variants
parameters: {"kv_heads":1}
Value Residual
Added value embeddings to first/last layers in MQA variants
parameters: null
MLP3x
Increased MLP width multiplier to use freed parameters
parameters: {"multiplier":3}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null

Novel Contributions

  • JEPA-style auxiliary losses for next-token prediction in the parameter golf regime
  • Spectral variance floor applied to hidden-state deltas to prevent dimensional collapse
  • Cosine-MSE latent prediction auxiliary head with stop-gradient target
  • Experimental analysis showing torch.compile confounded an apparent improvement
  • MQA plus value embeddings and 3x MLP architectural variants evaluated as alternatives