PR #50

closed

Record: Sliding Window Eval (stride=64), val_bpb=1.1925

by mattqlfView on GitHub
val_bpb
1.1925
Architecture
Transformer
Optimizer
Artifact Size
15,874,829 bytes

Training Techniques

Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":1024}
Architecture
weight tying
Tied input and output embeddings in the baseline architecture.
parameters: null
KV head count
Baseline Transformer uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024

Novel Contributions

  • Sliding window evaluation with stride 64 to score tokens using much richer context
  • Improved validation BPB entirely through evaluation strategy rather than training changes
  • Each validation token is scored exactly once with near-maximum context
  • Maintained artifact size under the 16MB cap while achieving a new record