val_bpb
1.3684
Architecture
Transformer
Optimizer
—
Artifact Size
15,149,719 bytes
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
KV head count
Used grouped-query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
PairHash
Enabled PairHash embeddings for the model.
parameters: {"buckets":8192,"pair_dimensions":96}
Quantization
int8
bits: 8
scope: model export
Compression
zstd
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
full validation on fineweb_val_* split
parameters: {"stride":2000}
Regularization
weight tying
parameters: null
Novel Contributions
- New 10-minute / 16MB-track submission record
- 11-layer, 448-dim GQA model with MLP multiplier 2
- PairHash embeddings with 8192 buckets and 96 pair dimensions
- int8 + zstd export path to keep the artifact under the 16MB cap
- Included exact train_gpt.py snapshot, train.log, submission.json, and README for the winning run