PR #832

open

Non-record: Byte-level transformer + JEPA auxiliary loss (val_bpb: 1.1903)

by jfprinczView on GitHub

val_bpb

1.1903

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.4 MB

Training Techniques

Architecture

Byte-level transformer

Autoregressive transformer operating directly on raw UTF-8 bytes with vocab size 260 and no tokenizer/BPE.

parameters: {"vocab_size":260,"layers":13,"dim":512,"num_heads":8,"num_kv_heads":4}

JEPA auxiliary loss

Auxiliary chunk-level latent prediction module added to the autoregressive transformer to improve validation BPB.

parameters: {"latent_dim":256,"proj_hidden":256,"chunk_size":8,"lambda_max":0.001}

U-Net skips

Skip connections in the transformer backbone.

parameters: null

Partial RoPE

Rotary positional embeddings applied only to a subset of dimensions.

parameters: {"dimensions":16}

XSA

Uses XSA in the last layers of the model.

parameters: {"last_layers":4}

BigramHash

Auxiliary bigram hashing component used in the stack.

parameters: {"vocab_size":4096,"dim":32}

SmearGate

Custom gating mechanism included in the model stack.

parameters: null

weight tying

Tied embeddings / tied output weights implied by the configuration.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":512}

Sequence Length

sequence_length

train_length: 4096

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"iterations":9000}

Regularization

layerwise LN scale

parameters: {"enabled":true}

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Initialization

OrthoInit

Orthogonal initialization used as part of the training stack.

Other

other

SIGReg regularization using Epps-Pulley projections and knots to prevent latent collapse in the JEPA module.

parameters: {"projections":256,"knots":17,"weight":0.02}

Novel Contributions

Byte-level autoregressive transformer with no tokenizer, operating directly on raw UTF-8 bytes
Lightweight JEPA auxiliary loss for chunk-level latent prediction
Reported consistent BPB improvement from JEPA across seeds and evaluation methods
Combination of JEPA with an existing sp1024-style technique stack
Use of SIGReg to prevent latent collapse in the auxiliary representation space