PR #49

RECORDclosed

SOTA attempt (val_bpb=1.2064)

by spokane-wayView on GitHub
val_bpb
1.2058
Architecture
Transformer
Optimizer
Artifact Size
15867270 bytes

Training Techniques

Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Compression
zlib
level: null

Novel Contributions

  • Long-context training at sequence length 2048
  • Tied input/output embeddings
  • Reduced KV head count (8 attention heads, 4 KV heads)
  • Standalone record script with baked-in defaults
  • Int8 + zlib roundtrip serialization for the final submission artifact