PR #125

open

Add non-record 16MB layers7 submission

val_bpb

1.3797

Architecture

Transformer

Optimizer

—

Artifact Size

10289996 bytes

Training Techniques

Architecture

tied embeddings

Uses tied input/output embeddings.

parameters: {"enabled":1}

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

depth reduction

Reduces model depth from the baseline 9 layers to 7 layers to improve the capacity-speed tradeoff under a strict wallclock cap.

parameters: {"layers":7}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Compression

zlib

level: null

Non-record 16MB submission documenting a shallower 7-layer variant.
Demonstrates that reducing depth can improve the capacity-speed tradeoff under a 600-second wallclock cap.
Uses tied embeddings and 4 KV heads in a compact Transformer configuration.
Reports a self-contained run with exact post-quantization roundtrip validation metrics.