Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.06527
$0.06527$0.06527
-4.04%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Michigan progresses Bitcoin Reserve bill to invest 10% state funds in Bitcoin

Michigan progresses Bitcoin Reserve bill to invest 10% state funds in Bitcoin

The post Michigan progresses Bitcoin Reserve bill to invest 10% state funds in Bitcoin appeared on BitcoinEthereumNews.com. Key Takeaways Michigan’s legislature is considering a bill to allow up to 10% of its public funds to be invested in Bitcoin. This move would make Michigan one of the most ambitious U.S. states regarding state-level Bitcoin adoption. Michigan advanced legislation today that would authorize the state to invest up to 10% of its public funds in Bitcoin, joining a growing wave of states exploring crypto asset reserves. The Strategic Bitcoin Reserve bill represents one of the most ambitious state-level Bitcoin adoption proposals to date. Over 20 U.S. states introduced or considered similar Bitcoin reserve legislation in 2024 and early 2025, reflecting increased institutional interest as Bitcoin prices reached new highs. Michigan’s pension fund already maintains small Bitcoin exposure through exchange-traded funds. The proposal aligns with broader federal cryptocurrency policy shifts under the Trump administration, which has expressed support for a national Bitcoin reserve. Such federal backing has encouraged state-level initiatives as governments seek portfolio diversification beyond traditional assets. Bitcoin proponents argue that state reserves could provide hedge protection against inflation and currency devaluation, similar to how sovereign wealth funds like Norway’s oil fund diversified into alternative investments. Critics cite Bitcoin’s price volatility as a risk for public funds. The legislation still requires additional legislative approval before Michigan could begin Bitcoin purchases for its state treasury operations. Source: https://cryptobriefing.com/michigan-advances-bitcoin-reserve-bill-2024/
Share
BitcoinEthereumNews2025/09/19 11:42
Will XRP Price Increase In September 2025?

Will XRP Price Increase In September 2025?

Ripple XRP is a cryptocurrency that primarily focuses on building a decentralised payments network to facilitate low-cost and cross-border transactions. It’s a native digital currency of the Ripple network, which works as a blockchain called the XRP Ledger (XRPL). It utilised a shared, distributed ledger to track account balances and transactions. What Do XRP Charts Reveal? […]
Share
Tronweekly2025/09/18 00:00
Why Was Coinbase’s Brian Armstrong Snubbed by Top US Bank CEOs at Davos?

Why Was Coinbase’s Brian Armstrong Snubbed by Top US Bank CEOs at Davos?

The post Why Was Coinbase’s Brian Armstrong Snubbed by Top US Bank CEOs at Davos? appeared first on Coinpedia Fintech News Reportedly, JPMorgan CEO Jamie Dimon
Share
CoinPedia2026/01/31 16:43