PR #903

open

[Notable Non-Record Submission] To JEPA or Not to JEPA: That Is Le Question (32.8M LeWorldModel Mamba2 Style Text Implementation - 1.2064 BPB )

by CiprianFlorin-IfrimView on GitHub

val_bpb

1.2064

Architecture

Mamba

Optimizer

Muon

Artifact Size

15.75 MB

Training Techniques

Architecture

weight tying

Tied input embeddings and output head to reduce parameter count, especially for BPE vocabularies.

parameters: {"vocab_size":8192}

U-Net skip connections

Encoder-decoder style skip connections with LIFO skip stack and residual mixing from the embedding output.

parameters: {"layers":10}

ReLU²

Squared ReLU MLP activation used for channel mixing.

parameters: null

Quantization

QAT

bits: 4

scope: large weights

FP8

bits: 8

scope: embeddings and medium matrices

Compression

brotli

level: null

Evaluation

sliding window eval

parameters: {"stride":16}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

sequence_length

train_length: 8192

eval_length: 8192

Regularization

logit softcap

parameters: {"cap":15}

weight decay

parameters: null

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"beta1":0.9,"beta2":0.95,"embed_lr":0.01,"scalar_lr":0.01}

Novel Contributions

Applies LeWorldModel-style JEPA with SIGReg to text language modeling
Combines Mamba-2 SSM with U-Net skip connections for a non-attention architecture
Uses multi-step latent prediction as an auxiliary training signal
Employs mixed INT4/FP8 quantization-aware training from step 1
Uses sliding-window evaluation to improve BPB on recurrent state models
Uses tied embeddings and factored embedding projections to fit within the 16MB budget