PR #1007

open

Submission/hybrid rwkv token shift

by dillon-blakeView on GitHub

val_bpb

1.2252

Architecture

Hybrid

Optimizer

Muon

Artifact Size

~15.86 MB

Training Techniques

Architecture

Hybrid

Replaced most attention layers with RWKV-inspired token-shift mixing while keeping a few short-window/full-context attention layers.

parameters: {"layers":11,"attention_layers":3,"token_shift_layers":8}

GQA

Used grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

BigramHash

Added hashed bigram embeddings to capture local token-pair context.

parameters: null

SmearGate

Applied a learned gate to blend each token with the previous token after embedding normalization.

parameters: null

Value Residual

Injected value embeddings into attention/value pathways with learned per-layer scaling.

parameters: null

XSA

Projected attention outputs away from the value direction to encourage diverse head representations.

parameters: {"layers":[7,10]}

Partial RoPE

Applied rotary position embeddings to only a subset of head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

LeakyReLU

Used LeakyReLU squared activation in the MLP instead of SwiGLU.

parameters: {"slope":0.5}

MLP3x

Used a 3x expansion MLP.

parameters: {"expansion":3}

Quantization

int6

bits: 6

scope: all

QAT

bits: null

scope: all

Compression

zlib

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"late_qat":true}

Regularization

logit softcap

parameters: {"cap":30}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Novel Contributions

Hybrid RWKV-inspired token-shift layers replacing most attention layers
Short-window attention in only a few layers with one final full-context attention layer
Learned per-dimension token interpolation for efficient local mixing
Combination of hybrid architecture with BigramHash, SmearGate, XSA, Partial RoPE, and value embeddings
Int6 quantized and zlib-compressed artifact under the 16MB limit