PR #696

open

Add non-record JEPA byte-level encoder-decoder submission

by gravelBridgeView on GitHub

val_bpb

1.2622

Architecture

JEPA encoder-decoder

Optimizer

SGD

Artifact Size

15.7MB

Training Techniques

Architecture

JEPA encoder-decoder

Uses a two-stage JEPA architecture with a depth-recurrent encoder and a causal decoder conditioned on encoder latents instead of a standard causal GPT.

parameters: {"encoder_layers":5,"encoder_repeats":2,"decoder_layers":7,"model_dim":480,"encoder_heads":6,"encoder_kv_heads":3,"decoder_heads":4,"patch_size":8,"latent_dim":192}

Quantization

int6

bits: 6

scope: all weights

STE QAT

bits: 6

scope: all weights

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002}

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs_per_chunk":2,"stride":256,"chunk_tokens":32768,"batch_seqs":32,"all_parameters_adapt":true}

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":256}

Sequence Length

sequence_length

train_length: 2047

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3500}

Regularization

SIGReg

parameters: {"applied_to":"latent projection / encoder outputs"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Other

other

Byte-level tokenizer with vocab 260 and no BPE.

parameters: {"vocab_size":260}

Novel Contributions

JEPA encoder-decoder architecture as an alternative to standard causal GPT submissions
Pure byte-level tokenizer with vocab 260 and no BPE
Depth-recurrent encoder with patch-based latent projection
INT6 optimal-clip quantization with STE QAT during warmdown
Sliding-window test-time training over all parameters