NeuroBench

Benchmark evaluation for NeuroClaw

NeuroBench is the benchmark suite used to evaluate end-to-end neuroimaging workflows, reproducibility readiness, and skill-guided execution.

Benchmark Overview

What NeuroBench covers

Coverage

Structural MRI, functional MRI, diffusion MRI, EEG, and multimodal integration tasks.

Evaluation

Planning quality, tool/skill reasonableness, code/command correctness, and reproducibility readiness.

Task Design

Each task directory contains a task.md instruction file with explicit inputs, outputs, and checks.

Benchmark Runs

How to execute tasks

NeuroBench supports baseline and skill-enabled runs. You can execute it from the Web UI or from the CLI batch runner.

with-skills: use skills loaded from skills/.
no-skills: run without skills for baseline comparison.
--benchmark-compare-skills: run paired variants for the same tasks.
Outputs are written to output/.

# Web benchmark mode
python core/agent/main.py --web --benchmark

# CLI benchmark batch runner
python core/agent/main.py --benchmark

# Paired skill comparison
python core/agent/main.py --benchmark --benchmark-compare-skills

Scoring

Use --score-benchmark to score existing reports in output/ with a GPT-5.4 weighted rubric.

python core/agent/main.py --score-benchmark
python core/agent/main.py --score-benchmark --score-workers 8

Benchmark workflow

Run tasks first, then score the generated reports to analyze quality and efficiency.

Get Started FAQ