PR #240

open

Non-record: local RTX 4070 SP1024 7x512 KV2 500-step run

val_bpb

1.6660

Architecture

Transformer

Optimizer

—

Artifact Size

10.94MB

Training Techniques

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses a KV-thin attention configuration with fewer key/value heads than query heads.

parameters: {"num_heads":8,"num_kv_heads":2}

Transformer depth/width

Shallower compact Transformer configuration for local GPU training.

parameters: {"layers":7,"model_dim":512}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmup

parameters: {"warmup_steps":4}

Other

other

Local non-record submission under the 16MB artifact cap using a single RTX 4070 Laptop GPU.

parameters: {"artifact_cap_bytes":16000000,"iterations":500}

Non-record local workstation run on a single RTX 4070 Laptop GPU
Shallower 7-layer, 512-dim Transformer with KV-thin attention (8 query heads, 2 KV heads)
Tied input/output embeddings
Full published validation split evaluation with first published training shard only
Compact int8+zlib artifact under the 16MB cap