PR #1600

open

Non-record submission: HELIX and HELIX MoR K7R2 U-Net (architecture report + finalized metadata)

by sayujshahView on GitHub
val_bpb
1.2781
Architecture
Transformer
Optimizer
Muon
Artifact Size
9,973,239 bytes

Training Techniques

Architecture
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Partial RoPE
Uses rotary position embeddings on only part of the head dimensions.
parameters: {"dimensions":16}
XSA
XSA applied in the final blocks.
parameters: null
depth recurrence
Recurrence-style virtual depth with repeated unique blocks to increase effective depth without linearly increasing parameters.
parameters: {"unique_blocks":5,"iterations":2}
U-Net skip connections
U-Net style skip structure across stages to stabilize information flow through repeated computation.
parameters: null
D-TPA
Differential tensor product attention with factored QKV and differential attention path.
parameters: {"rank":4}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.023,"scalar_lr":0.025,"tied_embed_lr":0.035,"adamw_wd":0.01}
Weight Averaging
EMA + SWA
parameters: {"decay":0.997,"enabled":true}
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmup
parameters: {"warmup_steps":20}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.01}

Novel Contributions

  • HELIX architecture with differential tensor product attention and recurrence-style virtual depth
  • U-Net skip connections for stabilizing repeated-block computation
  • High-capacity FFN design under a strict 16MB artifact budget
  • Muon plus AdamW optimizer routing with EMA/SWA for robustness
  • int6 per-row quantization with lzma compression for final packaging
  • Non-record research submission with full documentation and reproducible artifacts