PR #1352
openAdd non-record submission: Multi-model cross-attention with dimensional asymmetry
by alientonyView on GitHub
val_bpb
1.2450
Architecture
Hybrid
Optimizer
—
Artifact Size
~13.6 MB
Training Techniques
Architecture
multi-model single representation
Three different model types share a single representation and interact via cross-attention across asymmetric dimensions.
parameters: {"num_models":3,"model_types":["transformer","mlp","causal_depthwise"],"model_dims":[468,198,186],"shared_representation_dim":852,"cross_attention_dim":480}
depth recurrence
Recurrence is used to reinforce model behavior across steps.
parameters: {"recurrence":1}
KV head count
Attention uses a reduced number of KV heads.
parameters: {"heads":6,"kv_heads":2}
conv kernel
Convolutional kernel size used in the architecture.
parameters: {"kernel_size":4}
layers
Model depth used in the submission.
parameters: {"layers":4}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Novel Contributions
- Multi-model architecture with a shared representation
- Cross-attention across asymmetric model dimensions
- Combining transformer, MLP, and causal_depthwise models
- Observation that loss curve shape is largely determined by model dimension
- Improved convergence quality and stability through complementary model-type pairing