PR #195

open

Add chasewebb 9x512 sp1024 baseline (val_bpb: 1.2355)

by chasewebbView on GitHub
val_bpb
1.2355
Architecture
Transformer
Optimizer
Artifact Size
15.87MB

Training Techniques

Quantization
int8
bits: 8
scope: all
Architecture
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
tied embeddings
Input and output embeddings are tied.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null

Novel Contributions

  • 9-layer 512-dim transformer baseline
  • 1024-vocab SentencePiece BPE tokenizer
  • Grouped-query attention with 8 heads and 4 KV heads
  • Tied embeddings
  • Training on 80 FineWeb shards (~8B tokens)
  • int8+zlib artifact packaging