PR #258

open

Non-record: local RTX 4070 SP1024 7x512 KV4 500-step run

val_bpb

1.6572

Architecture

Transformer

Optimizer

—

Artifact Size

10296829 bytes

Training Techniques

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses a KV4 attention configuration with fewer key/value heads than query heads.

parameters: {"layers":7,"model_dim":512,"num_heads":8,"num_kv_heads":4}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmup

parameters: {"warmup_steps":4}

Compression

zlib

level: null

Non-record local consumer-GPU submission under the 16MB artifact cap
Shallower 7-layer, 512-dim KV4 configuration discovered through a local search loop
Evaluation on the full published validation split
Uses tied embeddings with separate tied-embedding and matrix learning rates