-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
524 lines (483 loc) · 28.7 KB
/
index.html
File metadata and controls
524 lines (483 loc) · 28.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- Primary Meta Tags -->
<meta name="title" content="FINER: MLLMs Hallucinate under Fine-grained Negative Queries">
<meta name="description" content="FINER studies hallucinations in multimodal large language models under fine-grained negative queries, introduces two benchmarks, and proposes FINER-Tuning to reduce them.">
<meta name="keywords" content="MLLM, multimodal large language models, hallucination, vision-language models, benchmark, DPO, fine-grained reasoning, visual question answering, FINER">
<meta name="author" content="Rui Xiao, Sanghwan Kim, Yongqin Xian, Zeynep Akata, Stephan Alaniz">
<meta name="robots" content="index, follow">
<meta name="language" content="English">
<!-- Academic/Research Specific -->
<meta name="citation_title" content="FINER: MLLMs Hallucinate under Fine-grained Negative Queries">
<meta name="citation_author" content="Rui Xiao">
<meta name="citation_author" content="Sanghwan Kim">
<meta name="citation_author" content="Yongqin Xian">
<meta name="citation_author" content="Zeynep Akata">
<meta name="citation_author" content="Stephan Alaniz">
<meta name="citation_publication_date" content="2026">
<meta name="citation_conference_title" content="CVPR">
<meta name="citation_pdf_url" content="https://YOUR_DOMAIN.com/static/pdfs/paper.pdf">
<!-- Additional SEO -->
<meta name="theme-color" content="#2563eb">
<meta name="msapplication-TileColor" content="#2563eb">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="default">
<!-- Preconnect for performance -->
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link rel="preconnect" href="https://ajax.googleapis.com">
<link rel="preconnect" href="https://documentcloud.adobe.com">
<link rel="preconnect" href="https://cdn.jsdelivr.net">
<title>FINER: MLLMs Hallucinate under Fine-grained Negative Queries - Rui Xiao, Sanghwan Kim, Yongqin Xian, Zeynep Akata, Stephan Alaniz | Academic Research</title>
<!-- Favicon and App Icons -->
<link rel="icon" type="image/x-icon" href="static/images/favicon.ico">
<link rel="apple-touch-icon" href="static/images/favicon.ico">
<!-- Critical CSS - Load synchronously -->
<link rel="stylesheet" href="static/css/bulma.min.css">
<link rel="stylesheet" href="static/css/index.css">
<!-- Non-critical CSS - Load asynchronously -->
<link rel="preload" href="static/css/bulma-carousel.min.css" as="style" onload="this.onload=null;this.rel='stylesheet'">
<link rel="preload" href="static/css/bulma-slider.min.css" as="style" onload="this.onload=null;this.rel='stylesheet'">
<link rel="preload" href="static/css/fontawesome.all.min.css" as="style" onload="this.onload=null;this.rel='stylesheet'">
<link rel="preload" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css" as="style" onload="this.onload=null;this.rel='stylesheet'">
<!-- Fallback for browsers that don't support preload -->
<noscript>
<link rel="stylesheet" href="static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="static/css/bulma-slider.min.css">
<link rel="stylesheet" href="static/css/fontawesome.all.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
</noscript>
<!-- Fonts - Optimized loading -->
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700;800&display=swap" rel="stylesheet">
<!-- Defer non-critical JavaScript -->
<script defer src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="https://documentcloud.adobe.com/view-sdk/main.js"></script>
<script defer src="static/js/fontawesome.all.min.js"></script>
<script defer src="static/js/bulma-carousel.min.js"></script>
<script defer src="static/js/bulma-slider.min.js"></script>
<script defer src="static/js/index.js"></script>
<!-- Structured Data for Academic Papers -->
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "ScholarlyArticle",
"headline": "FINER: MLLMs Hallucinate under Fine-grained Negative Queries",
"description": "FINER studies hallucinations in multimodal large language models under fine-grained negative queries, introduces two benchmarks, and proposes FINER-Tuning to reduce them.",
"author": [
{
"@type": "Person",
"name": "Rui Xiao",
"affiliation": {
"@type": "Organization",
"name": "Technical University of Munich"
}
},
{
"@type": "Person",
"name": "Sanghwan Kim",
"affiliation": {
"@type": "Organization",
"name": "Technical University of Munich"
}
},
{
"@type": "Person",
"name": "Yongqin Xian",
"affiliation": {
"@type": "Organization",
"name": "Google"
}
},
{
"@type": "Person",
"name": "Zeynep Akata",
"affiliation": {
"@type": "Organization",
"name": "Technical University of Munich"
}
},
{
"@type": "Person",
"name": "Stephan Alaniz",
"affiliation": {
"@type": "Organization",
"name": "Télécom Paris"
}
}
],
"datePublished": "2026-01-01",
"publisher": {
"@type": "Organization",
"name": "CVPR"
},
"url": "https://YOUR_DOMAIN.com/YOUR_PROJECT_PAGE",
"image": "https://YOUR_DOMAIN.com/static/images/social_preview.png",
"keywords": ["MLLM", "hallucination", "benchmark", "DPO", "fine-grained reasoning", "computer vision"],
"abstract": "Multimodal large language models struggle with hallucinations, particularly with fine-grained queries. We introduce FINER, along with two benchmarks, FINER-CompreCap and FINER-DOCCI, and propose FINER-Tuning to reduce hallucinations under fine-grained negative queries.",
"isAccessibleForFree": true,
"license": "https://creativecommons.org/licenses/by/4.0/",
"mainEntity": {
"@type": "WebPage",
"@id": "https://YOUR_DOMAIN.com/YOUR_PROJECT_PAGE"
},
"about": [
{
"@type": "Thing",
"name": "Multimodal Large Language Models"
},
{
"@type": "Thing",
"name": "Hallucination Evaluation"
}
]
}
</script>
</head>
<body>
<!-- Scroll to Top Button -->
<button class="scroll-to-top" onclick="scrollToTop()" title="Scroll to top" aria-label="Scroll to top">
<i class="fas fa-chevron-up"></i>
</button>
<main id="main-content">
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">FINER: MLLMs Hallucinate under Fine-grained Negative Queries</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">
<a href="https://www.eml-munich.de/people/rui-xiao" target="_blank">Rui Xiao</a><sup>1,2</sup>,</span>
<span class="author-block">
<a href="https://kim-sanghwan.github.io/" target="_blank">Sanghwan Kim</a><sup>1,2,3</sup>,</span>
<span class="author-block">
<a href="https://xianyongqin.github.io/" target="_blank">Yongqin Xian</a><sup>4</sup>,</span>
<span class="author-block">
<a href="https://www.eml-munich.de/people/zeynep-akata" target="_blank">Zeynep Akata</a><sup>1,2,3</sup>,</span>
<span class="author-block">
<a href="https://www.eml-munich.de/people/stephan-alaniz" target="_blank">Stephan Alaniz</a><sup>5</sup>
</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block">
<sup>1</sup>Technical University of Munich
<sup>2</sup>Munich Center for Machine Learning
<sup>3</sup>Helmholtz Munich<br>
<sup>4</sup>Google
<sup>5</sup>Télécom Paris
</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<span class="link-block">
<a href="https://arxiv.org/pdf/<ARXIV_PAPER_ID>.pdf" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<span class="link-block">
<span class="external-link button is-normal is-rounded is-dark disabled-button" aria-disabled="true">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</span>
</span>
<span class="link-block">
<span class="external-link button is-normal is-rounded is-dark disabled-button" aria-disabled="true">
<span class="icon">
<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg"
alt="Hugging Face" style="height: 1em; vertical-align: middle;">
</span>
<span>Models</span>
</span>
</span>
<span class="link-block">
<span class="external-link button is-normal is-rounded is-dark disabled-button" aria-disabled="true">
<span class="icon">
<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg"
alt="Hugging Face" style="height: 1em; vertical-align: middle;">
</span>
<span>FINER-Tuning</span>
</span>
</span>
<span class="link-block">
<span class="external-link button is-normal is-rounded is-dark disabled-button" aria-disabled="true">
<span class="icon">
<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg"
alt="Hugging Face" style="height: 1em; vertical-align: middle;">
</span>
<span>FINER-CompreCap</span>
</span>
</span>
<span class="link-block">
<span class="external-link button is-normal is-rounded is-dark disabled-button" aria-disabled="true">
<span class="icon">
<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg"
alt="Hugging Face" style="height: 1em; vertical-align: middle;">
</span>
<span>FINER-DOCCI</span>
</span>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<!-- Paper abstract -->
<section class="section hero is-light">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and “what” questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites and enhancing general multimodal capabilities across six benchmarks.
</p>
</div>
</div>
</div>
</div>
</section>
<!-- End paper abstract -->
<section class="section">
<div class="container is-max-desktop content">
<h2 class="title is-3">Motivational Study</h2>
<p>
We begin with a simple question: can MLLMs still reject a false statement when most of the query is correct? To test this, we progressively make negative queries more fine-grained. Starting from a single wrong object, we then add correct attributes and relations, so that each query contains only one contradiction while the rest remains visually consistent. This gives seven levels of granularity, from coarse to fine.
</p>
<p>
The trend is striking. As the query becomes more detailed, the base model becomes much more likely to answer “Yes” to claims that should be rejected (false-positive hallucinations). For InternVL3.5-14B, accuracy drops from around 80% to around 20% on FINER-CompreCap, and to around 15% on FINER-DOCCI. In other words, subtle mistakes hidden inside otherwise correct descriptions are much harder for MLLMs to detect. FINER-Tuning noticeably improves this behavior, especially at the finest levels, motivating the need for benchmarks and training data that specifically target fine-grained hallucinations.
</p>
<figure>
<img class="motivation-figure" src="static/images/motivational_study.png" alt="Motivational study showing performance under increasingly fine-grained negative queries">
<figcaption>Figure 1. As negative queries become more fine-grained, base MLLMs become more likely to hallucinate. FINER-Tuning improves robustness, especially at higher granularity.</figcaption>
</figure>
</div>
</section>
<section class="section">
<div class="container is-max-desktop content">
<h2 class="title is-3">FINER-Benchmarks</h2>
<p>
FINER is built to test whether an MLLM can spot a small mistake hidden inside a detailed query. We start from a scene graph containing objects, attributes, and relations. For each element, we generate several plausible but incorrect alternatives, such as replacing an object, changing a color, or modifying a relation. We then compose both positive and negative questions from these scene graphs.
</p>
<p>
The benchmark contains four settings. <strong>Multi-obj</strong> checks whether the model can detect one wrong object among several correct ones. <strong>Multi-attr</strong> does the same for attributes. <strong>Multi-rel</strong> focuses on relations. <strong>Wh</strong> asks “what” questions with one incorrect attribute embedded in the query. Instead of simple yes/no evaluation, FINER uses multiple-choice questions, forcing the model to identify the correct visual content.
</p>
<p>
We release two benchmark variants. <strong>FINER-CompreCap</strong> is built from CompreCap scene graphs. <strong>FINER-DOCCI</strong> is built from dense DOCCI captions, where we extract synthetic scene-graph-like annotations with Gemini, filter them with Qwen2.5-VL-72B, and human verification to specify filtering thresholds. Overall, the two benchmarks cover tens of thousands of MCQs and aim to systematically probe fine-grained hallucinations beyond coarse object-level mismatches with their scales.
</p>
<figure>
<img src="static/images/FINER-benchmarks.png" alt="Pipeline for constructing the FINER benchmarks">
<figcaption>Figure 2. FINER benchmark construction: extract or build scene graphs, generate plausible negatives, and compose paired multiple-choice questions across four settings.</figcaption>
</figure>
</div>
</section>
<section class="section">
<div class="container is-max-desktop content">
<h2 class="title is-3">FINER-Tuning</h2>
<p>
FINER-Tuning is a data-driven training pipeline designed to make MLLMs better at rejecting fine-grained false queries. We start from dense long captions from Pixmo, avoiding overlap with COCO and the DOCCI training split. Using Phi-4-14B, we extract four kinds of positive phrases that mirror our benchmark settings: object summaries, attribute summaries, relation summaries, and composed phrases for “what” questions.
</p>
<p>
We then generate minimally edited negative counterparts by changing exactly one semantic component, such as an object, an attribute, or a relation. From these positive and negative phrases, we build both positive and negative query-answer pairs. The accepted response always states the correct visual fact, while the rejected response gives the wrong one. For object, attribute, and relation questions, we use templates; for the freer Wh setting, we let the LLM generate the question-answer pairs directly.
</p>
<p>
Finally, we train the model with Direct Preference Optimization (DPO), so that it prefers grounded answers over hallucinated ones. This trains the model not only to answer correctly, but also to explicitly reject subtle false claims embedded in otherwise plausible queries.
</p>
<figure>
<img src="static/images/FINER-Tuning.png" alt="Training data generation pipeline for FINER-Tuning">
<figcaption>Figure 3. FINER-Tuning extracts fine-grained positive phrases, generates minimally edited negatives, constructs accepted/rejected answer pairs, and trains with DPO.</figcaption>
</figure>
</div>
</section>
<section class="section">
<div class="container is-max-desktop content">
<h2 class="title is-3">Results on FINER-Benchmarks</h2>
<p>
FINER is challenging even for strong frontier MLLMs. Performance drops sharply when models must reject subtle mistakes involving multiple objects, attributes, or relations, and the Wh setting remains particularly difficult. Prior hallucination-reduction methods that work on earlier benchmarks transfer poorly to FINER, showing that coarse hallucination benchmarks do not fully capture this problem.
</p>
<p>
FINER-Tuning consistently improves all four base models on both FINER-CompreCap and FINER-DOCCI. The gains are especially strong on the more fine-grained settings. For example, InternVL3.5-14B improves by up to <strong>24.2%</strong> on FINER-CompreCap, and the tuned 14B model becomes competitive with much larger or closed models in several settings. We also find a clear trend: performance declines as the number of queried attributes or relations increases, but FINER-Tuning reduces this drop and brings larger gains precisely where the questions are hardest.
</p>
</div>
</section>
<!-- Image carousel 1 -->
<section class="hero is-small">
<div class="hero-body">
<div class="container">
<div id="results-carousel-finer" class="carousel results-carousel">
<div class="item">
<div class="carousel-image-wrap">
<img src="static/images/FINER-benchmarks-results-tab.png" alt="Table of results on FINER benchmarks"/>
</div>
</div>
<div class="item">
<div class="carousel-image-wrap">
<img src="static/images/FINER-benchmarks-results-fig.png" alt="Figure showing performance trends on FINER benchmarks"/>
</div>
</div>
</div>
</div>
</div>
</section>
<!-- End image carousel 1 -->
<section class="section">
<div class="container is-max-desktop content">
<h2 class="title is-3">Results on General Hallucination Benchmarks</h2>
<p>
A key question is whether training on FINER only helps on FINER itself, or whether it generalizes to broader hallucination evaluation. Encouragingly, FINER-Tuning transfers well. Across eight existing hallucination benchmarks, it consistently improves Qwen2.5-VL and InternVL3.5 on both discriminative and generative settings. On DASH, for instance, it improves the two InternVL3.5 variants by 6.2% and 5.5%, and it also lowers hallucination on MMHal-Bench and improves scores on HaloQuest.
</p>
<p>
This matters because FINER-Tuning is not aimed just for a single narrow benchmark. Instead, it teaches models to detect subtle contradictions in queries, and this stronger discrimination ability carries over to other hallucination suites as well.
</p>
</div>
</section>
<!-- Image carousel 2 -->
<section class="hero is-small">
<div class="hero-body">
<div class="container">
<div id="results-carousel-hallu" class="carousel results-carousel">
<div class="item">
<div class="carousel-image-wrap">
<img src="static/images/Other_hallu_tab_1.png" alt="Results on other hallucination benchmarks part 1"/>
</div>
</div>
<div class="item">
<div class="carousel-image-wrap">
<img src="static/images/Other_hallu_tab_2.png" alt="Results on other hallucination benchmarks part 2"/>
</div>
</div>
<div class="item">
<div class="carousel-image-wrap">
<img src="static/images/Other_hallu_tab_3.png" alt="Results on other hallucination benchmarks part 3"/>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop content">
<h2 class="title is-3">Results on General Multimodal Capabilities</h2>
<p>
Hallucination reduction often comes with an alignment tax, where gains on the target task hurt general ability. FINER-Tuning avoids this trade-off. On six general-purpose multimodal benchmarks, it maintains or improves the performance of strong base models, including gains on MMStar, MMVP, NaturalBench, and V* Bench. For InternVL3.5-14B, the average score improves by <strong>1.4%</strong>. This suggests that FINER provides a useful training signal that complements, rather than damages, a model’s broader multimodal capabilities.
</p>
<figure>
<img class="general-figure" src="static/images/General_cap_tab.png" alt="Results on general multimodal capabilities">
<figcaption>Figure 4. Results on general capabilities.</figcaption>
</figure>
</div>
</section>
<section class="section">
<div class="container is-max-desktop content">
<h2 class="title is-3">Ablation Studies</h2>
<p>
We conduct two ablation studies to understand what drives FINER-Tuning. First, we ablate the <strong>training strategy</strong> by comparing DPO against SFT, and by training with only negative queries versus both positive and negative queries. Interestingly, the results show that SFT can even hurt performance, while DPO is consistently stronger. In particular, training with both positive and negative queries gives the best overall results, showing that FINER-Tuning benefits from learning both to confirm correct statements and to reject subtle false ones. Second, we ablate <strong>training subset selection</strong> by training on only one subset at a time: Multi-obj, Multi-attr, Multi-rel, or Wh. Models trained on a single subset perform best on their matching test setting, but still transfer somewhat to other settings. However, training on all subsets gives the most balanced performance overall, suggesting that FINER-Tuning learns a broader fine-grained rejection capability rather than overfitting to one query type.
We have also included a series of extra ablation studies in the supplementary material.
</p>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="ablation-grid">
<figure class="ablation-item">
<img src="static/images/Ablation_tab_1.png" alt="First ablation study table">
<figcaption>Training strategy ablation.</figcaption>
</figure>
<figure class="ablation-item">
<img src="static/images/Ablation_tab_2.png" alt="Second ablation study table">
<figcaption>Training subset ablation.</figcaption>
</figure>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop content">
<h2 class="title is-3">Qualitative Results on FINER benchmarks</h2>
</div>
</section>
<!-- Image carousel 2 -->
<section class="hero is-small">
<div class="hero-body">
<div class="container">
<div id="qualitative-results" class="carousel results-carousel">
<div class="item">
<div class="carousel-image-wrap">
<img src="static/images/multi-obj-quantitative.png" alt="Results on other hallucination benchmarks part 1"/>
</div>
</div>
<div class="item">
<div class="carousel-image-wrap">
<img src="static/images/multi-attr-quantitative.png" alt="Results on other hallucination benchmarks part 2"/>
</div>
</div>
<div class="item">
<div class="carousel-image-wrap">
<img src="static/images/multi-rel-quantitative.png" alt="Results on other hallucination benchmarks part 3"/>
</div>
</div>
<div class="item">
<div class="carousel-image-wrap">
<img src="static/images/wh-quantitative.png" alt="Results on other hallucination benchmarks part 4"/>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop content">
<h2 class="title is-3">Limitations and Future Work</h2>
<p>
FINER still has several limitations. Although we include human filtering, the scale of the benchmarks, especially FINER-DOCCI, makes full manual curation impractical. As a result, some samples may still contain annotation errors or ambiguous cases (we will provide more detailed visualizations later). In addition, our current multi-relation setting is limited to at most three relations, which does not yet capture richer relational compositions that appear in real-world scenes.
</p>
<p>
A natural next step is to build a larger and more challenging version of FINER with more objects, attributes, and relations per image, while increasing the level of human validation. We view FINER as a starting point for studying hallucinations hidden inside fine-grained queries, and hope it encourages broader work in this direction. Looking ahead, we believe the same idea can be extended beyond general-domain vision-language benchmarks to high-stakes settings such as medicine, finance, and law, where even subtle fine-grained errors can be costly.
</p>
</div>
</section>
<!-- BibTex citation -->
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<div class="bibtex-header">
<h2 class="title">BibTeX</h2>
<button class="copy-bibtex-btn" onclick="copyBibTeX()" title="Copy BibTeX to clipboard">
<i class="fas fa-copy"></i>
<span class="copy-text">Copy</span>
</button>
</div>
<pre id="bibtex-code"><code>@inproceedings{xiao2026finer,
title={FINER: MLLMs Hallucinate under Fine-grained Negative Queries},
author={Xiao, Rui and Kim, Sanghwan and Xian, Yongqin and Akata, Zeynep and Alaniz, Stephan},
journal={CVPR},
year={2026}
}</code></pre>
</div>
</section>
<!-- End BibTex citation -->
<footer class="footer">
<div class="container">
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This page was built using the <a href="https://github.com/eliahuhorwitz/Academic-project-page-template" target="_blank">Academic Project Page Template</a>, which was adopted from the <a href="https://nerfies.github.io" target="_blank">Nerfies</a> project page.
You are free to borrow the source code of this website, but please link back to this page in the footer. <br>
This website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
</div>
</div>
</div>
</div>
</footer>
</body>
</html>