PR #1600

open

Non-record submission: HELIX and HELIX MoR K7R2 U-Net (architecture report + finalized metadata)

by sayujshahView on GitHub

val_bpb

1.2781

Architecture

Transformer

Optimizer

Muon

Artifact Size

9,973,239 bytes

Training Techniques

Architecture

GQA

Grouped query attention with 8 query heads and 4 KV heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Partial RoPE

Uses rotary position embeddings on only part of the head dimensions.

parameters: {"dimensions":16}

XSA

XSA applied in the final blocks.

parameters: null

depth recurrence

Recurrence-style virtual depth with repeated unique blocks to increase effective depth without linearly increasing parameters.

parameters: {"unique_blocks":5,"iterations":2}

U-Net skip connections

U-Net style skip structure across stages to stabilize information flow through repeated computation.

parameters: null

D-TPA

Differential tensor product attention with factored QKV and differential attention path.

parameters: {"rank":4}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"matrix_lr":0.023,"scalar_lr":0.025,"tied_embed_lr":0.035,"adamw_wd":0.01}

Weight Averaging

EMA + SWA

parameters: {"decay":0.997,"enabled":true}

Compression

lzma

level: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmup

parameters: {"warmup_steps":20}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.01}

Novel Contributions

HELIX architecture with differential tensor product attention and recurrence-style virtual depth
U-Net skip connections for stabilizing repeated-block computation
High-capacity FFN design under a strict 16MB artifact budget
Muon plus AdamW optimizer routing with EMA/SWA for robustness
int6 per-row quantization with lzma compression for final packaging
Non-record research submission with full documentation and reproducible artifacts