val_bpb
1.1925
Architecture
Transformer
Optimizer
—
Artifact Size
15,874,829 bytes
Training Techniques
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":1024}
Architecture
weight tying
Tied input and output embeddings in the baseline architecture.
parameters: null
KV head count
Baseline Transformer uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Novel Contributions
- Sliding window evaluation with stride 64 to score tokens using much richer context
- Improved validation BPB entirely through evaluation strategy rather than training changes
- Each validation token is scored exactly once with near-maximum context
- Maintained artifact size under the 16MB cap while achieving a new record