PR #5

closed

[WIP] Sparse Attention + Recursive Weight Sharing for 16MB Efficiency

by albertorkiveView on GitHub
val_bpb
1.2244
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
sliding window attention
Restricts causal attention to a local window to reduce quadratic attention cost.
parameters: {"window_size":null}
weight tying
Shares weights across logical layers via recursive weight sharing to increase effective depth without increasing stored parameters.
parameters: {"physical_layers":null,"logical_layers":null}

Novel Contributions

  • Sliding window attention for sparse causal attention
  • Recursive weight sharing across logical layers
  • Architecture skeleton optimized for 16MB artifact efficiency
  • Hooks for torch.compile and quantization-ready layers