PR #50

closed

Record: Sliding Window Eval (stride=64), val_bpb=1.1925

val_bpb

1.1925

Architecture

Transformer

Optimizer

—

Artifact Size

15,874,829 bytes

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":64,"batch_seqs":1024}

Architecture

weight tying

Tied input and output embeddings in the baseline architecture.

parameters: null

KV head count

Baseline Transformer uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Sliding window evaluation with stride 64 to score tokens using much richer context
Improved validation BPB entirely through evaluation strategy rather than training changes
Each validation token is scored exactly once with near-maximum context
Maintained artifact size under the 16MB cap while achieving a new record