PR #749

open

Add 11L 448x2 PairHash int8+zstd 10-minute submission record

val_bpb

1.3684

Architecture

Transformer

Optimizer

—

Artifact Size

15,149,719 bytes

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

KV head count

Used grouped-query attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

PairHash

Enabled PairHash embeddings for the model.

parameters: {"buckets":8192,"pair_dimensions":96}

Quantization

int8

bits: 8

scope: model export

Compression

zstd

level: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Evaluation

full validation on fineweb_val_* split

parameters: {"stride":2000}

Regularization

weight tying

parameters: null

New 10-minute / 16MB-track submission record
11-layer, 448-dim GQA model with MLP multiplier 2
PairHash embeddings with 8192 buckets and 96 pair dimensions
int8 + zstd export path to keep the artifact under the 16MB cap
Included exact train_gpt.py snapshot, train.log, submission.json, and README for the winning run