val_bpb
1.4612
Architecture
Transformer
Optimizer
—
Artifact Size
15,983,603 bytes
Training Techniques
Architecture
XSA
Cross-layer self-attention applied only in the final 4 layers
parameters: {"layers":4}
tied embeddings
Input and output embeddings are tied
parameters: null
MLP activation
ReLU squared activation in MLP layers
parameters: null
Sequence Length
sequence_length
train_length: 256
eval_length: 256
LR Schedule
warmdown
parameters: {"warmdown_iters":200}
Compression
custom packed_zstd
level: null
Evaluation
stride-based eval
parameters: {"stride":256}
Other
other
Checkpoint frontier saving every 25 steps
parameters: {"checkpoint_interval_steps":25}
Novel Contributions
- Use of XSA (cross-layer self-attention) only in the final 4 layers of an 11-layer decoder-only transformer
- ReLU^2 activation in MLP layers
- Single H100 80GB hardware training with a 16MB artifact size submission
- Custom packed serialization with packed_zstd compression
- Checkpoint frontier saving every 25 steps
- Demonstration that artifact size is the main bottleneck rather than raw BPB