PR #1352

open

Add non-record submission: Multi-model cross-attention with dimensional asymmetry

by alientonyView on GitHub

val_bpb

1.2450

Architecture

Hybrid

Optimizer

—

Artifact Size

~13.6 MB

Training Techniques

Architecture

multi-model single representation

Three different model types share a single representation and interact via cross-attention across asymmetric dimensions.

parameters: {"num_models":3,"model_types":["transformer","mlp","causal_depthwise"],"model_dims":[468,198,186],"shared_representation_dim":852,"cross_attention_dim":480}

depth recurrence

Recurrence is used to reinforce model behavior across steps.

parameters: {"recurrence":1}

KV head count

Attention uses a reduced number of KV heads.

parameters: {"heads":6,"kv_heads":2}

conv kernel

Convolutional kernel size used in the architecture.

parameters: {"kernel_size":4}

layers

Model depth used in the submission.

parameters: {"layers":4}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Novel Contributions

Multi-model architecture with a shared representation
Cross-attention across asymmetric model dimensions
Combining transformer, MLP, and causal_depthwise models
Observation that loss curve shape is largely determined by model dimension
Improved convergence quality and stability through complementary model-type pairing