val_bpb
1.6572
Architecture
Transformer
Optimizer
—
Artifact Size
10296829 bytes
Training Techniques
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses a KV4 attention configuration with fewer key/value heads than query heads.
parameters: {"layers":7,"model_dim":512,"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmup
parameters: {"warmup_steps":4}
Compression
zlib
level: null
Novel Contributions
- Non-record local consumer-GPU submission under the 16MB artifact cap
- Shallower 7-layer, 512-dim KV4 configuration discovered through a local search loop
- Evaluation on the full published validation split
- Uses tied embeddings with separate tied-embedding and matrix learning rates