PR #696

open

Add non-record JEPA byte-level encoder-decoder submission

by gravelBridgeView on GitHub
val_bpb
1.2622
Architecture
JEPA encoder-decoder
Optimizer
SGD
Artifact Size
15.7MB

Training Techniques

Architecture
JEPA encoder-decoder
Uses a two-stage JEPA architecture with a depth-recurrent encoder and a causal decoder conditioned on encoder latents instead of a standard causal GPT.
parameters: {"encoder_layers":5,"encoder_repeats":2,"decoder_layers":7,"model_dim":480,"encoder_heads":6,"encoder_kv_heads":3,"decoder_heads":4,"patch_size":8,"latent_dim":192}
Quantization
int6
bits: 6
scope: all weights
STE QAT
bits: 6
scope: all weights
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs_per_chunk":2,"stride":256,"chunk_tokens":32768,"batch_seqs":32,"all_parameters_adapt":true}
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":256}
Sequence Length
sequence_length
train_length: 2047
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3500}
Regularization
SIGReg
parameters: {"applied_to":"latent projection / encoder outputs"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Other
other
Byte-level tokenizer with vocab 260 and no BPE.
parameters: {"vocab_size":260}

Novel Contributions

  • JEPA encoder-decoder architecture as an alternative to standard causal GPT submissions
  • Pure byte-level tokenizer with vocab 260 and no BPE
  • Depth-recurrent encoder with patch-based latent projection
  • INT6 optimal-clip quantization with STE QAT during warmdown
  • Sliding-window test-time training over all parameters