Logo OmniBrainBench

OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks

Zhihao Peng*1, Cheng Wang*,1, Shengyuan Liu*1, Zhiying Liang*2, Zanting Ye3, Min Jie Ju4, Peter YM Woo5, Yixuan Yuan†,1,

1Department of Electronic Engineering, The Chinese University of Hong Kong 2Sun Yat-sen Memorial Hospital, Sun Yat-sen University 3School of Biomedical Engineering, Southern Medical University
4Zhongshan Hospital, Fudan University 5Department of Neurosurgery, Prince of Wales Hospital

*Core Contributors
†Corresponding to: yxyuan@ee.cuhk.edu.hk,

Highlight

1. We introduce OmniBrainBench, the first comprehensive multimodal benchmark specifically designed to evaluate MLLMs across the complete spectrum of brain imaging analysis with closed- and open-ended evaluations, covering 9,527 clinically verified VQA pairs, 31,706 images, and 15 modalities.

2. We develop a multi-dimensional evaluation framework that mirrors the clinical workflow from anatomical and imaging assessment to therapeutic cycle management, assessing the capabilities of MLLMs across 15 multi-stage clinical tasks within brain imaging analysis.

3. We conduct extensive evaluations of 24 models across open-source general-purpose, medical-specialized, and proprietary MLLMs to reveal critical gaps in their visual-clinical reasoning, providing a detailed analysis of MLLMs in brain imaging.

Abstract

OmniBrainBench dataset

Brain imaging analysis is crucial for diagnosing and treating brain disorders, and multimodal large language models (MLLMs) are increasingly supporting it. However, current brain imaging visual question-answering (VQA) benchmarks either cover a limited number of imaging modalities or are restricted to coarse-grained pathological descriptions, hindering a comprehensive assessment of MLLMs across the full clinical continuum. To address these, we introduce OmniBrainBench, the first comprehensive multimodal VQA benchmark specifically designed to assess the multimodal comprehension capabilities of MLLMs in brain imaging analysis with closed- and open-ended evaluations. OmniBrainBench comprises 15 distinct brain imaging modalities collected from 30 verified medical sources, yielding 9,527 validated VQA pairs and 31,706 images. It simulates clinical workflows and encompasses 15 multi-stage clinical tasks rigorously validated by a professional radiologist. Evaluations of 24 state-of-the-art models, including open-source general-purpose, medical, and proprietary MLLMs, highlight the substantial challenges posed by OmniBrainBench. Experiments reveal that proprietary MLLMs like GPT-5 (63.37%) outperform open-source and medical MLLMs yet lag far behind physicians (91.35%), while medical MLLMs show wide variance in closed- and open-ended VQA. Open-source general-purpose MLLMs generally trail but excel in specific tasks, and all MLLMs fall short in complex preoperative reasoning, revealing a critical visual-to-clinical gap. OmniBrainBench establishes a new standard to assess MLLMs in brain imaging analysis, highlighting the gaps against physicians.

Statistics

Construction Process

algebraic reasoning

Construction process of OmniBrainBench with (a) data collection, (b) question augmentation, and (c) data filtering.

Experiment Results

Analysis of Closed-ended Evaluation

Performance of different MLLMs on five specialized clinical phases with 15 secondary subtasks on closed-ended VQA.

algebraic reasoning

Table 1: Performance of different MLLMs on five specialized clinical phases with 15 secondary subtasks on closed-ended VQA of OmniBrainBench. The best-performing model in each category is highlighted in bold, and the second best is highlighted in underlined.

Observation 1: Brain imaging analysis is challenging for MLLMs, with significant gaps between MLLMs and physicians. Physicians achieve an average accuracy of 91.35\% across all tasks, whereas the highest-performing model, Gemini-2.5-Pro, attained only 66.58\%, reflecting a substantial performance gap of approximately 24.77\%. This disparity underscores the intrinsic complexity of brain imaging analysis, which necessitates both precise visual interpretation and specialized clinical expertise. It indicates that, while open-source models benefit from structured contextual inputs, they exhibit limitations in knowledge-intensive and reasoning-dependent domains, highlighting the critical need for domain-specific pretraining and reasoning capabilities.

Observation 2: Medical MLLMs exhibit heterogeneous performance. The highest-performing HuatuoGPT-V-34B achieves a mean accuracy of 63.56\%, rendering it competitive with leading proprietary MLLMs, where it demonstrates superior performance in the clinical phases of IMI (69.55\%) and RS (40.84\%). In contrast, other medical MLLMs, e.g., MedGemma-4B (48.04\%) and Llava-Med-7B (38.84\%), display markedly lower aggregate scores, consistent with the observed general performance deficit. This suggests that while conducting domain-specific training, greater attention should be paid to balancing model generalization and task adaptability.

Observation 3: MLLMs expose the variation in task difficulty, exposing a gap between visual perception and medical comprehension. MLLMs and physicians consistently achieve high scores in tasks like prognostic factor analysis, clinical sign prediction, drug response prediction, and postoperative outcome assessment, where perfect scores of 100.00\% are seen. Conversely, tasks like risk stratification and preoperative assessment appear much more difficult, with significantly lower scores across all MLLMs (e.g., the highest-performing MLLM scores 40.84\% in risk stratification). Our findings highlight the importance of integrating medical knowledge and clinical reasoning beyond visual perception to bridge the performance gap in complex diagnostic and decision-making tasks.

Analysis of Open-ended Evaluation

Performance of different MLLMs on open-ended VQA.

algebraic reasoning

Table 2: Performance of different MLLMs on open-ended VQA of OmniBrainBench. Higher values indicate better performance in generation quality, semantic similarity, and fluency.

Observation 1: Lingshu series dominate open-source and overall leadership. Lingshu-32B decisively outperforms the much larger HuatuoGPT-V-34B, dominating lexical precision, fluency, and semantic alignment across all key metrics. It indicates that targeted multimodal architecture and data-efficient training now deliver superior generation quality over sheer parameter scale, proving efficiency trumps size in real-world MLLM performance.

Observation 2: Open-source MLLMs exhibit far greater performance variance than their proprietary counterparts. While trailblazers like Lingshu claim the top spots across ROUGE1, ROUGEL, and BERTScore, many others—especially medical variants—languish at the bottom, which indicates that the open ecosystem's rapid, decentralized innovation fuels both groundbreaking advances and pronounced instability in model quality.

Observation 3: Proprietary MLLMs are more balanced than open-source MLLMs. Open-source MLLMs surpass proprietary ones in ROUGE1 and BLEU, demonstrating the higher language consistency and fluency and revealing a paradigm shift in efficiency and accessibility.

algebraic reasoning

Figure 1: Diverse Modality Evaluation.

algebraic reasoning

Figure 2: Performance of models on different numbers of images.

Case Study

In this section, we conduct a comprehensive case study analysis of multiple MLLMs in our OmniBrainBench under various scenarios. The evaluation is structured into two primary tracks: closed-ended VQA and open-ended VQA, allowing for a nuanced assessment of model capabilities across different task formats.

GPT-5 Closed-ended VQA Samples
Figure 3: Correct/Error samples in GPT-5 closed-ended VQA.
GPT-5 Open-ended VQA Samples
Figure 4: Correct/Error samples in GPT-5 open-ended VQA.
Claude-4.5-Sonnet Closed-ended VQA Samples
Figure 5: Correct/Error samples in Claude-4.5-Sonnet closed-ended VQA.
Claude-4.5-Sonnet Open-ended VQA Samples
Figure 6: Correct/Error samples in Claude-4.5-Sonnet open-ended VQA.
Gemini-2.5-Pro Closed-ended VQA Samples
Figure 7: Correct/Error samples in Gemini-2.5-Pro closed-ended VQA.
Gemini-2.5-Pro Open-ended VQA Samples
Figure 8: Correct/Error samples in Gemini-2.5-Pro open-ended VQA.
Deepseek-V3.1 Closed-ended VQA Samples
Figure 9: Correct/Error samples in Deepseek-V3.1 closed-ended VQA.
Deepseek-V3.1 Open-ended VQA Samples
Figure 10: Correct/Error samples in Deepseek-V3.1 open-ended VQA.
Qwen3-VL-30B Closed-ended VQA Samples
Figure 11: Correct/Error samples in Qwen3-VL-30B closed-ended VQA.
Qwen3-VL-30B Open-ended VQA Samples
Figure 12: Correct/Error samples in Qwen3-VL-30B open-ended VQA.
Lingshu-32B Closed-ended VQA Samples
Figure 13: Correct/Error samples in Lingshu-32B closed-ended VQA.
Lingshu-32B Open-ended VQA Samples
Figure 14: Correct/Error samples in Lingshu-32B open-ended VQA.