PR #832

open

Non-record: Byte-level transformer + JEPA auxiliary loss (val_bpb: 1.1903)

by jfprinczView on GitHub
val_bpb
1.1903
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.4 MB

Training Techniques

Architecture
Byte-level transformer
Autoregressive transformer operating directly on raw UTF-8 bytes with vocab size 260 and no tokenizer/BPE.
parameters: {"vocab_size":260,"layers":13,"dim":512,"num_heads":8,"num_kv_heads":4}
JEPA auxiliary loss
Auxiliary chunk-level latent prediction module added to the autoregressive transformer to improve validation BPB.
parameters: {"latent_dim":256,"proj_hidden":256,"chunk_size":8,"lambda_max":0.001}
U-Net skips
Skip connections in the transformer backbone.
parameters: null
Partial RoPE
Rotary positional embeddings applied only to a subset of dimensions.
parameters: {"dimensions":16}
XSA
Uses XSA in the last layers of the model.
parameters: {"last_layers":4}
BigramHash
Auxiliary bigram hashing component used in the stack.
parameters: {"vocab_size":4096,"dim":32}
SmearGate
Custom gating mechanism included in the model stack.
parameters: null
weight tying
Tied embeddings / tied output weights implied by the configuration.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":512}
Sequence Length
sequence_length
train_length: 4096
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"iterations":9000}
Regularization
layerwise LN scale
parameters: {"enabled":true}
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Initialization
OrthoInit
Orthogonal initialization used as part of the training stack.
Other
other
SIGReg regularization using Epps-Pulley projections and knots to prevent latent collapse in the JEPA module.
parameters: {"projections":256,"knots":17,"weight":0.02}

Novel Contributions

  • Byte-level autoregressive transformer with no tokenizer, operating directly on raw UTF-8 bytes
  • Lightweight JEPA auxiliary loss for chunk-level latent prediction
  • Reported consistent BPB improvement from JEPA across seeds and evaluation methods
  • Combination of JEPA with an existing sp1024-style technique stack
  • Use of SIGReg to prevent latent collapse in the auxiliary representation space