deep learning - MultiModal Cross attention - Stack Overflow

I am dealing with two embeddings, text and image both are last_hidden_state of transfomer models (bert

I am dealing with two embeddings, text and image both are last_hidden_state of transfomer models (bert and vit), so the shapes are (batch, seq, emd_dim). I want to feed text information to image using a cross attention mechanism and I was wondering whether this line of code will give me what I need:

cross_attention = nn.MultiheadAttention(embed_dim=768, num_heads=12, dropout=0.1)
attn_output, attn_output_weights = cross_attention(text_last, img_last, img_last)

I tried the provided code but I am not sure whether it is the right approach

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1742316094a4420867.html

相关推荐

  • deep learning - MultiModal Cross attention - Stack Overflow

    I am dealing with two embeddings, text and image both are last_hidden_state of transfomer models (bert

    14小时前
    20

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信