Explore
This are public items saved by our community
TL;DR
- We propose UniVG-R1, a reasoning guided MLLM for universal visual grounding, which employs GRPO training combined with a cold-start initialization to effectively enhance reasoning capabilities across multimodal contexts.
- A high-quality CoT grounding dataset is introduced, encompassing diverse tasks, each meticulously annotated with detailed reasoning chains to facilitate advanced reasoning-based grounding.
- We identify a difficulty bias in GRPO training, and propose a difficulty-aware weight adjustment strategy. Experiments validate that GRPO equipped with this strategy consistently enhance the model performance.
- Extensive experiments demonstrate that our model achieves state-of-the-art performance across multiple grounding benchmarks, showcasing its versatility and generalizability.
UniVG-R1 tackles a wide range of visual grounding tasks with complex and implicit instructions. By combining GRPO training with a cold-start initialization, it effectively reasons over instructions and visual inputs, significantly improving grounding performance. Our model achieves state-of-the-art results on MIG-Bench and exhibits superior zero-shot performance on four reasoning-guided grounding benchmarks with an average 23.4% improvement.
Abstract
Traditional visual grounding methods primarily focus on single-image scenarios with simple textual references. However, extending these methods to real-world scenarios that involve implicit and complex instructions, particularly in conjunction with multiple images, poses significant challenges, which is mainly due to the lack of advanced reasoning ability across diverse multi-modal contexts. In this work, we aim to address the more practical universal grounding task, and propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding, which enhances reasoning capabilities through reinforcement learning (RL) combined with cold-start data. Specifically, we first construct a high-quality Chain-of-Thought (CoT) grounding dataset, annotated with detailed reasoning chains, to guide the model towards correct reasoning paths via supervised fine-tuning. Subsequently, we perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities. In addition, we identify a difficulty bias arising from the prevalence of easy samples as RL training progresses, and we propose a difficulty-aware weight adjustment strategy to further strengthen the performance. Experimental results demonstrate the effectiveness of UniVG-R1, which achieves state-of-the-art performance on MIG-Bench with a 9.1% improvement over the previous method. Furthermore, our model exhibits strong generalizability, achieving an average improvement of 23.4% in zero-shot performance across four image and video reasoning grounding benchmarks.
Pipeline
We adopt a two-stage training process. The first stage employs CoT-SFT, with the training data construction shown in (a). The second stage utilizes GRPO equipped with a difficulty-aware weight adjustment strategy in (b). The GRPO training process is illustrated in (c), where the policy model generates multiple responses, and each is assigned a distinct reward.
Results
Difficulty-Aware Weight Adjustment Strategy
During the stage 2 reinforcement learning process, we observe that most samples progressively become easier for the model, with the proportion of easy samples increasing and the proportion of hard samples steadily decreases. Since the GRPO algorithm normalizes rewards to calculate the relative advantage within each group, easy samples (e.g., (\textit{mIoU}) = 0.8) receives the same policy gradient update as hard samples (e.g., (\textit{mIoU}) = 0.2). This leads to a difficulty-bias issue. In particular, during the later stages of training, as easy samples become predominant, most updates are derived from these easier instances, making it difficult for the model to focus on hard samples.
To address this problem, we propose a difficulty-aware weight adjustment strategy, which dynamically adjusts the weight of each sample based on its difficulty. Specifically, we introduce a difficulty coefficient ( \phi \propto -\textit{mIoU} ) to quantify the difficulty level of each sample, where the function ( \phi ) is negatively correlated with (\textit{mIoU}). This coefficient dynamically adjusts the sample weights by computing the average accuracy reward of different responses for each sample. The detailed formula is provided below.
[
\mathcal{J}{GRPO}(\theta) = \mathbb{E}{q \sim P(Q), {o_i}{i=1}^G \sim \pi{\theta_{old}}(O|q)} \left[
\frac{1}{G}\sum_{i=1}^G {\color{blue} \phi(\mathit{mIoU})} \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}A_i - \beta\mathbb{D}{KL}(\pi{\theta}||\pi_{ref})
\right]
]
Visualization
Acknowledgement
Our work is primarily based on Migician , VLM-R1 , LLaMA-Factory , lmms-eval . We are sincerely grateful for their excellent works.
BibTeX
`@article{bai2025univg,
title={UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning},
author={Bai, Sule and Li, Mingxing and Liu, Yong and Tang, Jing and Zhang, Haoji and Sun, Lei and Chu, Xiangxiang and Tang, Yansong},
journal={arXiv preprint arXiv:2505.14231},
year={2025}
}`
Required Materials
- 24+ Normal Rarity Siege Crossbows (ilvl 79-82)
- 24+ Perfect Orbs of Transmutation
- 24+ Perfect Orbs of Augmentation
- A few Exalted Orbs
- Omen of the Liege
- Omen of Sinistral Necromancy
- Preserved Jawbone
- Perfect Essence of Battle
- Perfect Essence of Haste
- Greater Essence of Abrasion
Step 1: Getting % Phys on a Magic Siege Crossbow
You can either buy a Magic Siege Crossbow with Tier 2+ (155%+ Physical Damage) or roll it yourself. Using Perfect Transmute and Perfect Augments, you have a 1 in 25 chance to hit T2 % increased Physical Damage (weight of 50/2255 on prefixes)
If you get unlucky, you can use the Reforging Bench to 3-to-1 your Siege Crossbow bases to try get ones that you can Perfect Aug a Prefix onto to try get T2 % Phys.
You should now have a Magic Siege Crossbow with Tier 2 % increased Physical Damage and any Suffix (use a Perfect Augment on your Crossbow if it only has the % Phys before moving to the next step)
Step 2: Getting the Flat Physical Damage
This next step is easy, just use a Greater Essence of Abrasion to get flat Physical Damage (Tier 3 equivalent)
Step 3: Getting the Grenade Damage Modifier
Pay attention here so you don't make a mistake! In this step, we use Omen of the Liege, Omen of Sinistral Necromancy and a Preserved Jawbone
-
Right click BOTH Omens to make sure they are active
-
Use the Preserved Jawbone on the Crossbow - this guarantees it will add a Prefix and that Prefix will grant an Amanamu Modifier (the Grenade Modifier is one of these)
-
Take it to the Well of Souls and Reveal the modifier
The modifier we want is % increased Grenade Damage/Grenade Duration and is seemingly guaranteed when Revealing an Amanamu Prefix
Step 4: Filling the Suffixes
Before we move on to applying our Perfect Essences, we HAVE TO FILL OUR SUFFIXES!
- Use an Exalted Orb until your Crossbow has THREE Suffixes
We will now be using an Essence of Haste and Essence of Battle. These Essences will remove a random modifier before adding their special modifier. These modifiers both happen to be Suffixes, which means that if we have full Suffixes, they will have to remove a Suffix to make space to add their special modifier. The Physical Damage and Grenade Damage modifiers are all Prefixes, so these are safe, so long as our Suffixes are full before we use these Essences
- Use an Perfect Essence of Haste. This will the 20-25% chance to gain Onslaught on Kill (this is cheaper than Essence of Battle, so we use it first)
- Next, use a Perfect Essence of Battle to add +6 to Attack Skills. This has a 1 in 3 chance to remove the Perfect Essence of Haste mod we just added. If this happens, use another Perfect Essence of Haste and hope it doesn't remove the Essence of Battle modifier. Luckily these can only remove Suffixes, so our Prefixes are safe on this step
Once you have both Essence modifiers on the Crossbow, you are done and should have something like this
Thanks to Monsieur for sharing this Crossbow craft with me - it's super great for my Doomslayer Deadeye build