PR #1097

open

Parameter golf jepa

by danielxmedView on GitHub

val_bpb

1.3355

Architecture

Transformer

Optimizer

Muon

Artifact Size

~16.5 MB

Training Techniques

Architecture

U-Net skip connections

Uses a U-Net style encoder-decoder transformer with stored skip connections between encoder and decoder blocks.

parameters: {"layers":10,"encoder_layers":5,"decoder_layers":5,"dimensions":512}

LeakyReLU

Uses LeakyReLU(0.5)^2 as the activation function.

parameters: {"negative_slope":0.5}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

RoPE

Uses rotary positional embeddings.

parameters: null

weight tying

Tied embeddings with softcap logits.

parameters: null

JEPA bottleneck

Adds a 256-dim latent projection and next-position latent predictor at the bottleneck for JEPA-style training.

parameters: {"jepa_dim":256}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz_orthogonalization":true}

Quantization

int8

bits: 8

scope: artifact

Compression

zlib

level: null

Regularization

logit softcap

parameters: null

SIGReg

parameters: {"weight":0.03,"num_proj":128,"subsample":4096}

Other

other

Uses a JEPA auxiliary loss on latent embeddings with stop-gradient targets and MSE prediction loss.

parameters: {"jepa_alpha":0.1}

Novel Contributions

First application of JEPA to text compression / language model parameter golf
Adds a JEPA-style latent projection and next-position predictor at the transformer bottleneck
Introduces SIGReg to prevent latent collapse in a text setting
Claims zero inference overhead from JEPA components because they are used only during training
Combines CE with auxiliary JEPA MSE and SIGReg losses