PR #125

open

Add non-record 16MB layers7 submission

by akshai0296View on GitHub
val_bpb
1.3797
Architecture
Transformer
Optimizer
Artifact Size
10289996 bytes

Training Techniques

Architecture
tied embeddings
Uses tied input/output embeddings.
parameters: {"enabled":1}
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
depth reduction
Reduces model depth from the baseline 9 layers to 7 layers to improve the capacity-speed tradeoff under a strict wallclock cap.
parameters: {"layers":7}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null

Novel Contributions

  • Non-record 16MB submission documenting a shallower 7-layer variant.
  • Demonstrates that reducing depth can improve the capacity-speed tradeoff under a 600-second wallclock cap.
  • Uses tied embeddings and 4 KV heads in a compact Transformer configuration.
  • Reports a self-contained run with exact post-quantization roundtrip validation metrics.