val_bpb
1.1085
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
15,977,978 bytes
Training Techniques
Architecture
JEPA
Auxiliary joint-embedding predictive loss that predicts future hidden states in latent space across multiple horizons.
parameters: {"horizons":[1,2,4,8]}
XSA
Cross-sequence attention applied to all layers.
parameters: {"layers":11}
U-Net skip connections
Encoder-decoder skip connections in the 11-layer architecture.
parameters: {"layers":11}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
BigramHash
Bigram hash embedding component.
parameters: {"size":2048}
SmearGate
SmearGate gating mechanism.
parameters: null
Partial RoPE
Rotary position embeddings applied only to part of the head dimension.
parameters: {"dimensions":16}
LeakyReLU
LeakyReLU squared activation.
parameters: {"squared":true,"negative_slope":0.5}
Test-Time Training
full TTT
parameters: {"mode":"pre-quantization","epochs":3}
Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
EMA
parameters: null
Compression
lzma
level: 6
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"cosine_decay":true}
Novel Contributions
- JEPA auxiliary training signal for language modeling
- AdamW test-time training applied before quantization on EMA-averaged weights
- Full Hessian-aware GPTQ quantization
- Flash-Attention 3 for faster training
- LZMA compression to fit under the 16MB limit
- Cross-sequence attention on all 11 layers