PR #981
openNon-record: Sliding Patch Attentions + MoE (2-layer compact run)
by BurguerJohnView on GitHub
val_bpb
1.4893
Architecture
Transformer
Optimizer
—
Artifact Size
3938328 bytes
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Uses grouped-query attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
U-Net skip connections
Includes encoder/decoder-style skip connections in the experimental branch.
parameters: null
attention modifications
Experimental sliding-patch attention and router-path attention variants are present in the codebase.
parameters: null
MoE
Mixture-of-experts routing code paths are included, though the logged run reports moe_layers:0/2 so they were inactive in the measured submission.
parameters: {"moe_layers":0,"total_layers":2}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Compression
zlib
level: null
Novel Contributions
- Sliding patch attention in the experimental training script
- Mixture-of-experts/router code paths included in the branch
- Compact 2-layer non-record run on a single H100
- Tied-embedding compact baseline submission with post-quantization export