PR #1112

open

Submission/hybrid RWKV token shift

by dillon-blakeView on GitHub

val_bpb

1.2252

Architecture

Hybrid

Optimizer

Muon

Artifact Size

~15.86 MB

Training Techniques

Architecture

token-shift mixing

Replaced most attention layers with RWKV-inspired local token-shift mixing that blends the current token with the previous token using learned per-dimension interpolation weights.

parameters: {"layers":8}

attention window

Used short-window quadratic attention in some layers, with the final attention layer retaining full context.

parameters: {"layers":3,"window":128}

GQA

Used grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Used a 3x expansion MLP.

parameters: {"expansion":3}

LeakyReLU

Used LeakyReLU squared activation in the MLP instead of SwiGLU.

parameters: {"slope":0.5}

BigramHash

Added a hashed bigram embedding to capture local token-pair context.

parameters: null

SmearGate

Applied a learned gate after embedding normalization to blend each token with the previous token before the first layer.

parameters: null

Value Residual

Added a shared value embedding projected into KV-head space and applied with a learned per-layer scale.

parameters: {"layers":[9,10]}

XSA

Projected attention outputs away from the value direction to encourage diverse head representations.

parameters: {"layers":[7,10]}

Partial RoPE

Applied rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

Quantization

int6

bits: 6

scope: all

late QAT

bits: null

scope: all

Weight Averaging

EMA

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Compression

zlib

level: null

Regularization

logit softcap

parameters: {"threshold":30}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Novel Contributions

Hybrid RWKV-inspired token-shift layers replacing most attention layers
Short-window attention combined with a final full-context attention layer
Learned per-dimension interpolation token mixing using previous-token blending
Combination of hybrid architecture with BigramHash, SmearGate, XSA, Partial RoPE, EMA, late QAT, and Muon