PR #53

closed

1.1888 BPB via SP-4096 compression + stride-64 sliding window

by kshitizz36View on GitHub
val_bpb
1.1888
Architecture
Encoder-decoder Transformer
Optimizer
Muon
Artifact Size
15.68MB

Training Techniques

Architecture
tied embeddings
Input and output embeddings are tied to reduce parameters and fit within the artifact budget.
parameters: {"tie_embeddings":1}
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
reduced depth
Reduced model depth to fit the larger vocabulary and embedding table within the 16MB limit.
parameters: {"layers":8}
Quantization
int8
bits: 8
scope: all
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Used an SP-4096 tokenizer / dataset variant to improve compression ratio and reduce tokens per byte.
parameters: {"vocab_size":4096}
other
Disabled periodic validation during training to maximize training steps within the wallclock budget.
parameters: {"val_loss_every":0}

Novel Contributions

  • SP-4096 tokenizer with improved compression ratio
  • Stride-64 sliding window evaluation
  • Multiplicative stacking of tokenizer compression and evaluation-context improvements via the BPB formula
  • 8-layer 512-dim GQA encoder-decoder with skip connections
  • Post-quant int8+zlib roundtrip evaluation