PR #903
open[Notable Non-Record Submission] To JEPA or Not to JEPA: That Is Le Question (32.8M LeWorldModel Mamba2 Style Text Implementation - 1.2064 BPB )
by CiprianFlorin-IfrimView on GitHub
val_bpb
1.2064
Architecture
Mamba
Optimizer
Muon
Artifact Size
15.75 MB
Training Techniques
Architecture
weight tying
Tied input embeddings and output head to reduce parameter count, especially for BPE vocabularies.
parameters: {"vocab_size":8192}
U-Net skip connections
Encoder-decoder style skip connections with LIFO skip stack and residual mixing from the embedding output.
parameters: {"layers":10}
ReLU²
Squared ReLU MLP activation used for channel mixing.
parameters: null
Quantization
QAT
bits: 4
scope: large weights
FP8
bits: 8
scope: embeddings and medium matrices
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: {"stride":16}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
sequence_length
train_length: 8192
eval_length: 8192
Regularization
logit softcap
parameters: {"cap":15}
weight decay
parameters: null
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"beta1":0.9,"beta2":0.95,"embed_lr":0.01,"scalar_lr":0.01}
Novel Contributions
- Applies LeWorldModel-style JEPA with SIGReg to text language modeling
- Combines Mamba-2 SSM with U-Net skip connections for a non-attention architecture
- Uses multi-step latent prediction as an auxiliary training signal
- Employs mixed INT4/FP8 quantization-aware training from step 1
- Uses sliding-window evaluation to improve BPB on recurrent state models
- Uses tied embeddings and factored embedding projections to fit within the 16MB budget