PR #554

open

Non-record: 11L + XSA4 H100 frontier (1.4612 BPB legal)

by chrisnkunoView on GitHub
val_bpb
1.4612
Architecture
Transformer
Optimizer
Artifact Size
15,983,603 bytes

Training Techniques

Architecture
XSA
Cross-layer self-attention applied only in the final 4 layers
parameters: {"layers":4}
tied embeddings
Input and output embeddings are tied
parameters: null
MLP activation
ReLU squared activation in MLP layers
parameters: null
Sequence Length
sequence_length
train_length: 256
eval_length: 256
LR Schedule
warmdown
parameters: {"warmdown_iters":200}
Compression
custom packed_zstd
level: null
Evaluation
stride-based eval
parameters: {"stride":256}
Other
other
Checkpoint frontier saving every 25 steps
parameters: {"checkpoint_interval_steps":25}

Novel Contributions

  • Use of XSA (cross-layer self-attention) only in the final 4 layers of an 11-layer decoder-only transformer
  • ReLU^2 activation in MLP layers
  • Single H100 80GB hardware training with a 16MB artifact size submission
  • Custom packed serialization with packed_zstd compression
  • Checkpoint frontier saving every 25 steps
  • Demonstration that artifact size is the main bottleneck rather than raw BPB