val_bpb
1.3797
Architecture
Transformer
Optimizer
—
Artifact Size
10289996 bytes
Training Techniques
Architecture
tied embeddings
Uses tied input/output embeddings.
parameters: {"enabled":1}
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
depth reduction
Reduces model depth from the baseline 9 layers to 7 layers to improve the capacity-speed tradeoff under a strict wallclock cap.
parameters: {"layers":7}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null
Novel Contributions
- Non-record 16MB submission documenting a shallower 7-layer variant.
- Demonstrates that reducing depth can improve the capacity-speed tradeoff under a 600-second wallclock cap.
- Uses tied embeddings and 4 KV heads in a compact Transformer configuration.
- Reports a self-contained run with exact post-quantization roundtrip validation metrics.