Medical AI Scientist

Why This Matters

Existing autonomous "AI Scientist" systems are largely domain-agnostic.
In medicine, that gap limits reliability, clinical relevance, and translational feasibility. Medical AI Scientist introduces a framework explicitly designed for healthcare constraints, specialized data modalities, and evidence standards.

Core Contributions

Clinically grounded ideation with structured literature evidence.
Medical-specific execution pipelines and evaluation protocols.
Iterative manuscript drafting with relevance and ethics checks.

System and Research Modes

The framework supports increasing autonomy across three modes.

Mode 1: Reproduction

Faithful re-implementation of specified hypotheses or target papers.

Mode 2: Innovation

Literature-inspired ideation and adaptation for clinically meaningful improvements.

Mode 3: Exploration

Open-ended discovery under task-driven objectives and domain constraints.

Framework overview — Overview of the paper.

Benchmark

Med-AI-Bench

A structured benchmark for evaluating autonomous medical AI research from hypothesis quality to experimental execution and paper-level outputs.

171 Curated Cases

19 Medical AI Tasks

6 Data Modalities

3 Autonomy Modes

How Med-AI-Bench Is Built

Each case is grounded in peer-reviewed reference papers and organized for multi-stage evaluation: idea quality, research-plan completeness, executable experimentation, and paper-level output quality.

Key Results

LLM evaluation for idea generation, idea completion and experimentation

Idea Refinement

Evidence-Enhanced Ideas

In mode-2 ideation, Medical AI Scientist does not stop at proposing a plausible method. It cross-checks raw ideas against medical literature and engineering evidence, then refines them into designs with fewer unsupported assumptions and a clearer implementation path.

BioASQ Factoid QA

Case 1: BioASQ span extraction becomes evidence-grounded and calibration-aware

Human idea score: ours 23 vs best baseline 14

Baseline ideas

GPT-5

Proposed a memory-augmented QA architecture with UMLS-aware hierarchy reasoning, external memory, and reinforcement learning. The design is broad and ambitious, but it introduces several moving parts without a tight link to BioASQ factoid constraints.

Gemini 2.5 Pro

Proposed a unified multi-task framework for factoid, list, and yes/no QA. It is implementable, but it stays generic and does not directly address answer calibration or surface-form mismatch in factoid extraction.

Our raw idea

Started from a span-entailment co-training design with an in-document evidence graph and answer-type priors. The direction already targeted grounded span selection, but the mechanism for span normalization and synonym robustness was still diffuse.

Evidence used

Medical paper: Sequence tagging for biomedical extractive question answering shows that biomedical QA should move beyond single-span extraction plus brittle post-processing. This supports replacing loosely coupled span heuristics with a globally normalized span model.
Medical paper: External features enriched model for biomedical question answering reports gains from lexical and syntactic features on BioASQ factoid QA, supporting explicit use of question-conditioned alignment and feature-aware span scoring.
Engineering paper: From flat direct models to segmental CRF models supports modeling spans as segments rather than independent start/end points, which directly motivated the segmental CRF layer.
Engineering paper: Alignment Information via Optimal Transport and Pre-training for Neural Machine Translation supports using optimal transport as an explicit token-alignment prior, motivating OT-guided question-context alignment before span decoding.

Evidence-enhanced final idea

Optimal-Transport Alignment Guided Segmental CRF with Neural Edit-Transduction Normalizer. The final design sharpens the original concept into three concrete pieces: OT-based question-context alignment, a globally normalized segmental CRF for span selection, and a lightweight edit-transduction normalizer for biomedical synonym and orthographic variation.

Why it is better

Replaces vague reasoning loops with task-matched span modeling.
Uses literature-backed lexical and syntactic signals instead of ad hoc heuristics.
Improves feasibility because each module maps to a standard, implementable component.
Reduces hallucination risk by grounding answer selection in aligned evidence and normalized spans.

MIMIC-IV Lab Risk Forecasting

Case 2: Free-form generative forecasting is refined into calibrated set-based EHR modeling

Human idea score: ours 25 vs best baseline 18

Baseline ideas

GPT-5

Proposed a dynamic graph-transformer over patient trajectories. The idea is expressive, but the graph construction is broad and underspecified for irregular lab forecasting.

Gemini 2.5 Pro

Proposed retrieval-augmented generative lab forecasting. It is novel, but it introduces a large retrieval-and-generation stack that is harder to validate and calibrate for continuous lab values.

Our raw idea

Started with a text-style autoregressive transformer (LabGPT) that tokenized heterogeneous EHR events and generated future lab values as text. The idea was flexible, but still leaned too heavily on generation for a calibrated numeric forecasting task.

Evidence used

Medical paper: MedGCN: Medication recommendation and lab test imputation via graph convolutional networks supports modeling heterogeneous clinical entities jointly rather than flattening them into one text stream, and provides precedent for lab-focused EHR reasoning with missing values.
Medical paper: Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms highlights the need for reliable evaluation and calibration under limited supervision, reinforcing the move away from unconstrained token generation toward uncertainty-aware prediction.
Engineering paper: Predicting Stroke from Electronic Health Records supports explicitly modeling inter-dependent EHR risk factors instead of treating history as a plain sequence of tokens.
Engineering paper: Graph-Based Temporal Attention for Coronary Artery Disease Prediction Using Electronic Health Records supports temporal set/graph reasoning for irregular EHR events, motivating a structured encoder over heterogeneous visits.

Evidence-enhanced final idea

FlowLab-SET: a time-conditioned Set Transformer encoder with conditional monotonic normalizing flows for calibrated lab forecasting. The refined version drops the free-form decoder in favor of a structured event encoder and density model that better matches the continuous, irregular, uncertainty-sensitive nature of lab prediction.

Why it is better

Matches the prediction target: numeric lab forecasting instead of text generation.
Uses heterogeneous EHR structure supported by prior medical and engineering work.
Improves implementability with a clearer encoder-plus-density-model decomposition.
Reduces unsupported generative behavior and makes uncertainty calibration explicit.

Case Studies

Read the full case study PDFs directly below.

Resources

Full Paper (PDF)

BibTeX

@misc{wu2026medicalaiscientist,
            title={Towards a Medical AI Scientist}, 
            author={Hongtao Wu and Boyun Zheng and Dingjie Song and Yu Jiang and Jianfeng Gao and Lei Xing and Lichao Sun and Yixuan Yuan},
            year={2026},
            eprint={2603.28589},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2603.28589}, 
          }

Medical AI Scientist

Why This Matters

Core Contributions

System and Research Modes

Mode 1: Reproduction

Mode 2: Innovation

Mode 3: Exploration

Med-AI-Bench

How Med-AI-Bench Is Built

Key Results

Evidence-Enhanced Ideas

Case 1: BioASQ span extraction becomes evidence-grounded and calibration-aware

Baseline ideas

Our raw idea

Evidence used

Evidence-enhanced final idea

Why it is better

Case 2: Free-form generative forecasting is refined into calibrated set-based EHR modeling

Baseline ideas

Our raw idea

Evidence used

Evidence-enhanced final idea

Why it is better

Case Studies

Case Study 1

Case Study 2

Case Study 3

Case Study 4

Resources

BibTeX