PR #53

closed

1.1888 BPB via SP-4096 compression + stride-64 sliding window

by kshitizz36View on GitHub

val_bpb

1.1888

Architecture

Encoder-decoder Transformer

Optimizer

Muon

Artifact Size

15.68MB

Training Techniques

Architecture

tied embeddings

Input and output embeddings are tied to reduce parameters and fit within the artifact budget.

parameters: {"tie_embeddings":1}

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

reduced depth

Reduced model depth to fit the larger vocabulary and embedding table within the 16MB limit.

parameters: {"layers":8}

Quantization

int8

bits: 8

scope: all

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Evaluation

sliding window eval

parameters: {"stride":64}

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Used an SP-4096 tokenizer / dataset variant to improve compression ratio and reduce tokens per byte.

parameters: {"vocab_size":4096}

other

Disabled periodic validation during training to maximize training steps within the wallclock budget.

parameters: {"val_loss_every":0}

Novel Contributions

SP-4096 tokenizer with improved compression ratio
Stride-64 sliding window evaluation
Multiplicative stacking of tokenizer compression and evaluation-context improvements via the BPB formula
8-layer 512-dim GQA encoder-decoder with skip connections
Post-quant int8+zlib roundtrip evaluation