Medical AI Scientist

Domain-aware autonomous scientific discovery for clinical AI:
from evidence-grounded ideation and medical experimentation to manuscript drafting and review.

Evaluated on Med-AI-Bench 19 Tasks 6 Data Modalities
Institution 1
Institution 2
Institution 3
Institution 4

Why This Matters

Existing autonomous "AI Scientist" systems are largely domain-agnostic.
In medicine, that gap limits reliability, clinical relevance, and translational feasibility. Medical AI Scientist introduces a framework explicitly designed for healthcare constraints, specialized data modalities, and evidence standards.

Core Contributions

  1. Clinically grounded ideation with structured literature evidence.
  2. Medical-specific execution pipelines and evaluation protocols.
  3. Iterative manuscript drafting with relevance and ethics checks.

System and Research Modes

The framework supports increasing autonomy across three modes.

Mode 1: Reproduction

Faithful re-implementation of specified hypotheses or target papers.

Mode 2: Innovation

Literature-inspired ideation and adaptation for clinically meaningful improvements.

Mode 3: Exploration

Open-ended discovery under task-driven objectives and domain constraints.

Benchmark

Med-AI-Bench

A structured benchmark for evaluating autonomous medical AI research from hypothesis quality to experimental execution and paper-level outputs.

171 Curated Cases
19 Medical AI Tasks
6 Data Modalities
3 Autonomy Modes

How Med-AI-Bench Is Built

Each case is grounded in peer-reviewed reference papers and organized for multi-stage evaluation: idea quality, research-plan completeness, executable experimentation, and paper-level output quality.

Key Results

LLM evaluation for idea generation, idea completion and experimentation

Idea Refinement

Evidence-Enhanced Ideas

In mode-2 ideation, Medical AI Scientist does not stop at proposing a plausible method. It cross-checks raw ideas against medical literature and engineering evidence, then refines them into designs with fewer unsupported assumptions and a clearer implementation path.

BioASQ Factoid QA

Case 1: BioASQ span extraction becomes evidence-grounded and calibration-aware

Human idea score: ours 23 vs best baseline 14

Baseline ideas

GPT-5

Proposed a memory-augmented QA architecture with UMLS-aware hierarchy reasoning, external memory, and reinforcement learning. The design is broad and ambitious, but it introduces several moving parts without a tight link to BioASQ factoid constraints.

Gemini 2.5 Pro

Proposed a unified multi-task framework for factoid, list, and yes/no QA. It is implementable, but it stays generic and does not directly address answer calibration or surface-form mismatch in factoid extraction.

Our raw idea

Started from a span-entailment co-training design with an in-document evidence graph and answer-type priors. The direction already targeted grounded span selection, but the mechanism for span normalization and synonym robustness was still diffuse.

Evidence used

  • Medical paper: Sequence tagging for biomedical extractive question answering shows that biomedical QA should move beyond single-span extraction plus brittle post-processing. This supports replacing loosely coupled span heuristics with a globally normalized span model.
  • Medical paper: External features enriched model for biomedical question answering reports gains from lexical and syntactic features on BioASQ factoid QA, supporting explicit use of question-conditioned alignment and feature-aware span scoring.
  • Engineering paper: From flat direct models to segmental CRF models supports modeling spans as segments rather than independent start/end points, which directly motivated the segmental CRF layer.
  • Engineering paper: Alignment Information via Optimal Transport and Pre-training for Neural Machine Translation supports using optimal transport as an explicit token-alignment prior, motivating OT-guided question-context alignment before span decoding.

Evidence-enhanced final idea

Optimal-Transport Alignment Guided Segmental CRF with Neural Edit-Transduction Normalizer. The final design sharpens the original concept into three concrete pieces: OT-based question-context alignment, a globally normalized segmental CRF for span selection, and a lightweight edit-transduction normalizer for biomedical synonym and orthographic variation.

Why it is better

  • Replaces vague reasoning loops with task-matched span modeling.
  • Uses literature-backed lexical and syntactic signals instead of ad hoc heuristics.
  • Improves feasibility because each module maps to a standard, implementable component.
  • Reduces hallucination risk by grounding answer selection in aligned evidence and normalized spans.

Case Studies

Read the full case study PDFs directly below.

Resources

BibTeX

@misc{wu2026medicalaiscientist,
            title={Towards a Medical AI Scientist}, 
            author={Hongtao Wu and Boyun Zheng and Dingjie Song and Yu Jiang and Jianfeng Gao and Lei Xing and Lichao Sun and Yixuan Yuan},
            year={2026},
            eprint={2603.28589},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2603.28589}, 
          }