PR #5

closed

[WIP] Sparse Attention + Recursive Weight Sharing for 16MB Efficiency

by albertorkiveView on GitHub

val_bpb

1.2244

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

sliding window attention

Restricts causal attention to a local window to reduce quadratic attention cost.

parameters: {"window_size":null}

weight tying

Shares weights across logical layers via recursive weight sharing to increase effective depth without increasing stored parameters.

parameters: {"physical_layers":null,"logical_layers":null}