val_bpb
1.3355
Architecture
Transformer
Optimizer
Muon
Artifact Size
~16.5 MB
Training Techniques
Architecture
U-Net skip connections
Uses a U-Net style encoder-decoder transformer with stored skip connections between encoder and decoder blocks.
parameters: {"layers":10,"encoder_layers":5,"decoder_layers":5,"dimensions":512}
LeakyReLU
Uses LeakyReLU(0.5)^2 as the activation function.
parameters: {"negative_slope":0.5}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
RoPE
Uses rotary positional embeddings.
parameters: null
weight tying
Tied embeddings with softcap logits.
parameters: null
JEPA bottleneck
Adds a 256-dim latent projection and next-position latent predictor at the bottleneck for JEPA-style training.
parameters: {"jepa_dim":256}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_orthogonalization":true}
Quantization
int8
bits: 8
scope: artifact
Compression
zlib
level: null
Regularization
logit softcap
parameters: null
SIGReg
parameters: {"weight":0.03,"num_proj":128,"subsample":4096}
Other
other
Uses a JEPA auxiliary loss on latent embeddings with stop-gradient targets and MSE prediction loss.
parameters: {"jepa_alpha":0.1}
Novel Contributions
- First application of JEPA to text compression / language model parameter golf
- Adds a JEPA-style latent projection and next-position predictor at the transformer bottleneck
- Introduces SIGReg to prevent latent collapse in a text setting
- Claims zero inference overhead from JEPA components because they are used only during training
- Combines CE with auxiliary JEPA MSE and SIGReg losses