PR #1007

open

Submission/hybrid rwkv token shift

by dillon-blakeView on GitHub
val_bpb
1.2252
Architecture
Hybrid
Optimizer
Muon
Artifact Size
~15.86 MB

Training Techniques

Architecture
Hybrid
Replaced most attention layers with RWKV-inspired token-shift mixing while keeping a few short-window/full-context attention layers.
parameters: {"layers":11,"attention_layers":3,"token_shift_layers":8}
GQA
Used grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
BigramHash
Added hashed bigram embeddings to capture local token-pair context.
parameters: null
SmearGate
Applied a learned gate to blend each token with the previous token after embedding normalization.
parameters: null
Value Residual
Injected value embeddings into attention/value pathways with learned per-layer scaling.
parameters: null
XSA
Projected attention outputs away from the value direction to encourage diverse head representations.
parameters: {"layers":[7,10]}
Partial RoPE
Applied rotary position embeddings to only a subset of head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
LeakyReLU
Used LeakyReLU squared activation in the MLP instead of SwiGLU.
parameters: {"slope":0.5}
MLP3x
Used a 3x expansion MLP.
parameters: {"expansion":3}
Quantization
int6
bits: 6
scope: all
QAT
bits: null
scope: all
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"late_qat":true}
Regularization
logit softcap
parameters: {"cap":30}
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Hybrid RWKV-inspired token-shift layers replacing most attention layers
  • Short-window attention in only a few layers with one final full-context attention layer
  • Learned per-dimension token interpolation for efficient local mixing
  • Combination of hybrid architecture with BigramHash, SmearGate, XSA, Partial RoPE, and value embeddings
  • Int6 quantized and zlib-compressed artifact under the 16MB limit