PR #749

open

Add 11L 448x2 PairHash int8+zstd 10-minute submission record

by FyeJordyView on GitHub
val_bpb
1.3684
Architecture
Transformer
Optimizer
Artifact Size
15,149,719 bytes

Training Techniques

Architecture
weight tying
Tied input and output embeddings.
parameters: null
KV head count
Used grouped-query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
PairHash
Enabled PairHash embeddings for the model.
parameters: {"buckets":8192,"pair_dimensions":96}
Quantization
int8
bits: 8
scope: model export
Compression
zstd
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
full validation on fineweb_val_* split
parameters: {"stride":2000}
Regularization
weight tying
parameters: null

Novel Contributions

  • New 10-minute / 16MB-track submission record
  • 11-layer, 448-dim GQA model with MLP multiplier 2
  • PairHash embeddings with 8192 buckets and 96 pair dimensions
  • int8 + zstd export path to keep the artifact under the 16MB cap
  • Included exact train_gpt.py snapshot, train.log, submission.json, and README for the winning run