Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

2025/11/20 00:00
2 min read
For feedback or concerns regarding this content, please contact us at [email protected]

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.04128
$0.04128$0.04128
-3.77%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

Top Low-Cost Cryptocurrencies Analysts Are Watching for 2027

Top Low-Cost Cryptocurrencies Analysts Are Watching for 2027

Investors are now hunting for projects that combine affordability with actual utility. While famous names still hold the spotlight, a new crypto era of decentralized
Share
Techbullion2026/03/14 10:49
AI Startups Unleashing Google Cloud’s Astounding Growth

AI Startups Unleashing Google Cloud’s Astounding Growth

The post AI Startups Unleashing Google Cloud’s Astounding Growth appeared on BitcoinEthereumNews.com. AI Startups Unleashing Google Cloud’s Astounding Growth Skip to content Home AI News AI Startups Unleashing Google Cloud’s Astounding Growth Source: https://bitcoinworld.co.in/ai-startups-boost-google-cloud/
Share
BitcoinEthereumNews2025/09/19 08:04
Bitcoin Mining Difficulty Hits New Peak, Squeezing Miner Profits

Bitcoin Mining Difficulty Hits New Peak, Squeezing Miner Profits

The post Bitcoin Mining Difficulty Hits New Peak, Squeezing Miner Profits appeared on BitcoinEthereumNews.com. Key Notes Bitcoin’s network difficulty has hit a new record, indicating a significant increase in the total computing power securing the network. This higher difficulty strengthens Bitcoin’s security protocols, making the blockchain more resilient to potential 51% attacks. Miners now face increased operational costs and pressure on profits, which could worsen the existing concentration of power among top mining pools. Bitcoin BTC $116 204 24h volatility: 0.8% Market cap: $2.32 T Vol. 24h: $37.24 B miners are feeling the pressure as the network’s mining difficulty climbed to a new all-time high on September 19. While the milestone makes Bitcoin more secure than ever, it also intensifies the economic challenge for those who maintain the network, forcing them to spend more resources for the same reward. This difficulty adjustment is a built-in feature of the network, designed to respond to changes in computing power, or hash rate. The new record, visible on blockchain explorers like Mempool.space, confirms a massive influx of powerful hardware has come online. This self-regulating mechanism makes sure blocks are found every 10 minutes on average, but it creates a competitive, high-stakes environment for miners. A Shrinking Piece of the Pie Chart showcasing the Bitcoin mining difficulty rate growth over the past year. | Image source: Mempool.space The news sparked immediate and divided reactions from a community whose long-term sentiment has recently been shifting toward asset accumulation. Many celebrated the network’s hardened defenses, with one X user noting it showcases Bitcoin’s “unmatched network strength.” However, others pointed to the direct financial consequences. All miners compete for the same pool of rewards. Over the last 24 hours (approximately 144 blocks), that “pie” consisted of about 453.22 BTC, worth over $52 million. With the new difficulty, each miner’s slice of that pie shrinks, meaning they must deploy more hash power…
Share
BitcoinEthereumNews2025/09/19 21:00