PR #377
openHybrid INL + Sort-Split MoE (1.41/1.46 bpb TTT, 15.5MB, 1xH100)
by Complexity-MLView on GitHub
val_bpb
1.4072
Architecture
Hybrid Transformer
Optimizer
—
Artifact Size
15.5MB
Training Techniques
Architecture
GQA + RoPE
Classical grouped-query attention with rotary positional embeddings in early layers.
parameters: {"layers":[0,1,2,3,4]}
INL BetaMu attention
Error-driven O(n) attention using causal cumsum over (x - mu) instead of QKV attention matrices.
parameters: {"layers":[5,6,7,8]}
Sort-Split MoE
Deterministic argsort + fixed split routing with 4 experts, keeping all experts busy and compatible with fullgraph compilation.
parameters: {"experts":4}
ALiBi
Learned slopes per head used as positional encoding in INL layers.
parameters: null
Token-routed MoE
Deterministic token_id % 4 routing across 4 experts with mask-multiply pattern.
parameters: {"experts":4}
PID Dynamics / INL Ultra-Lite
Learnable equilibrium mu traverses all layers with fixed alpha/beta/gate and clamped velocity for stabilizing hidden state trajectories.
parameters: {"layers":9}
SwiGLU
Replaces relu^2 activation with SwiGLU in expert MLPs.
parameters: {"experts":4}
LR Schedule
cosine warm restarts (SGDR)
parameters: {"cycle_lengths":[5000,10000,20000],"peak_lr_decay":0.7}
Weight Averaging
SWA
parameters: {"checkpoints":23}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Novel Contributions
- Hybrid architecture combining classical GQA attention with INL error-driven O(n) attention
- Sort-and-split MoE routing with deterministic argsort + fixed split
- Token-routed MoE with perfectly balanced 4-expert routing
- PID-style dynamics with a learnable equilibrium mu traversing all layers
- ALiBi positional encoding in INL layers
- Cosine warm restarts learning-rate schedule
- SWA checkpoint averaging