val_bpb
1.6660
Architecture
Transformer
Optimizer
—
Artifact Size
10.94MB
Training Techniques
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses a KV-thin attention configuration with fewer key/value heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":2}
Transformer depth/width
Shallower compact Transformer configuration for local GPU training.
parameters: {"layers":7,"model_dim":512}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmup
parameters: {"warmup_steps":4}
Other
other
Local non-record submission under the 16MB artifact cap using a single RTX 4070 Laptop GPU.
parameters: {"artifact_cap_bytes":16000000,"iterations":500}
Novel Contributions
- Non-record local workstation run on a single RTX 4070 Laptop GPU
- Shallower 7-layer, 512-dim Transformer with KV-thin attention (8 query heads, 2 KV heads)
- Tied input/output embeddings
- Full published validation split evaluation with first published training shard only
- Compact int8+zlib artifact under the 16MB cap