PR #247

open

Non-record: local RTX 4070 SP1024 8x512 KV4 seq768 500-step run

val_bpb

1.6114

Architecture

Transformer

Optimizer

—

Artifact Size

10036271 bytes

Training Techniques

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses fewer key/value heads than attention heads.

parameters: {"layers":8,"model_dim":512,"num_heads":8,"num_kv_heads":4}

Sequence Length

sequence_length

train_length: 768

eval_length: null

Compression

zlib

level: null

Other

other

Post-training int8 zlib roundtrip evaluation of the serialized model artifact.

parameters: {"serialized_model_bytes":9988629,"total_submission_bytes":10036271}

Non-record local consumer-GPU submission under the 16MB artifact cap
Throughput-oriented search path for an 8-layer 512-dim configuration
Full published validation split evaluation using fineweb_val_*
Compact local RTX 4070 Laptop GPU run with tied embeddings and reduced KV heads
Public non-record anchor for a candidate family selected through repeated local search and validation