PR #5
closed[WIP] Sparse Attention + Recursive Weight Sharing for 16MB Efficiency
by albertorkiveView on GitHub
val_bpb
1.2244
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
sliding window attention
Restricts causal attention to a local window to reduce quadratic attention cost.
parameters: {"window_size":null}
weight tying
Shares weights across logical layers via recursive weight sharing to increase effective depth without increasing stored parameters.
parameters: {"physical_layers":null,"logical_layers":null}
Novel Contributions
- Sliding window attention for sparse causal attention
- Recursive weight sharing across logical layers
- Architecture skeleton optimized for 16MB artifact efficiency
- Hooks for torch.compile and quantization-ready layers