PR #34

closed

[Partial submission] naive baseline + dispersion loss

by ChenLiu-1996View on GitHub
val_bpb
1.2244
Architecture
Transformer
Optimizer
Artifact Size
15,863,489 bytes

Training Techniques

Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Regularization
dispersion loss
parameters: null
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Training used a 10-minute wallclock cap on 8xH100 with periodic validation every 200 steps on the full validation split.
parameters: {"max_wallclock_seconds":600,"num_gpus":8,"val_every_steps":200}

Novel Contributions

  • Simple baseline with dispersion loss
  • Tied input/output embeddings
  • Reduced KV head count
  • Int8 quantized submission with zlib compression
  • Training under a 10-minute wallclock cap on 8xH100