DataSharpener
← Back to home

Case Study

Clinical Trial Insights: Automated Web Scraping, Data Engineering, and Tableau Analytics

Analyzing global medical research trends using real data from ClinicalTrials.gov

🔍 Overview

Clinical trials are the backbone of medical research, yet understanding global trends—such as which diseases are most researched, how trials progress through phases, and how activity changes over time—requires structured and continuous data collection.

In this project, I built an end-to-end data analytics pipeline that automates the collection, cleaning, and visualization of clinical trial information from ClinicalTrials.gov, one of the world’s largest public medical trial registries.

The final solution includes:

  • Python web scraper that pulls clinical trial metadata
  • Data cleaning & enrichment pipeline (duration, year extraction, condition grouping)
  • Tableau dashboard for exploring conditions, statuses, phases, durations, trends, and intervention types

This project demonstrates my ability to combine data engineering, analytics, and visualization in a real-world healthcare context.

🧱 Data Pipeline Architecture

1. Data Collection (Web Scraping)

Using Python (requests, JSON parsing), I retrieved clinical trials related to:

  • Obesity
  • Hypertension
  • Diabetes
  • Type 2 diabetes
  • Depression
  • Breast cancer

The scraper uses the ClinicalTrials.gov classic study_fields API, pulling structured fields such as:

  • NCTId
  • Condition
  • Overall Status
  • Study Type
  • Phase
  • Start & Completion Dates
  • Intervention Type
  • Location Country (when available)
  • A total of 4,503 clinical trials were collected.

2. Data Cleaning & Enrichment (Python + pandas)

Transformations included:

  • Parsing Start/Completion dates
  • Extracting Start Year
  • Calculating trial duration in days and months
  • Normalizing conditions into a main_condition
  • Standardizing statuses (e.g., “Active, not recruiting”)

Filtering out trials without duration data for some analyses

Final dataset highlights:

  • 4,503 total trials
  • 1,707 unique medical conditions
  • 1,501 trials with full duration data
  • Average trial duration: 35.55 months
  • The cleaned file is used as the data source for Tableau.

📊 Tableau Dashboard Overview

The interactive dashboard allows exploration of clinical trials across disease areas, statuses, phases, and timelines.

1. Top Conditions by Research Volume

  • The top researched conditions include:
  • Obesity
  • Hypertension
  • Diabetes mellitus / Type 2 diabetes
  • Depression
  • These conditions show the highest concentration of clinical trials in the dataset.

2. Trial Status Breakdown

  • A treemap visualization shows:
  • Completed: 63%
  • Recruiting: 15%
  • Terminated: 6.18%
  • Active, not recruiting: 5.58%
  • Not yet recruiting: 4.72%
  • Completed trials dominate, while terminated trials form a smaller but important segment.

3. Trends Over Time (2000–2025)

  • Clinical trial activity increased dramatically from 2000 to 2015, with a peak around 2018.
  • From 2020 to 2025, the number of new trials declined sharply, likely influenced by:
  • Post-COVID changes in research funding
  • Shifts in trial priorities
  • Operational challenges in running large studies
  • The trend stabilizes at lower levels after 2021.

4. Trial Duration by Phase (Box Plot)

  • Median trial durations:
  • Early Phase 1: 27.4 months
  • Phase 1: 15.2 months
  • Phase 1/2: 24.3 months
  • Phase 2: 27.4 months
  • Phase 2/3: 28.4 months
  • Phase 3: 24.4 months
  • Phase 4: 21.1 months
  • Observations:
  • Phases 2 and 3 show similar duration profiles.
  • Phase 1 trials are significantly shorter.
  • Phase 4 trials stabilize at ~21 months.

5. Intervention Types

  • Most trials fall into:
  • Drug-based interventions (largest)
  • Behavioral interventions
  • Medical devices
  • This highlights where innovation and research investments are concentrated.

Embedded dashboard

🛠 Tech Stack

  • Python:
  • requests
  • pandas
  • pathlib
  • Data Storage:
  • CSV (clean + raw versions)
  • Visualization:
  • Tableau Desktop / Tableau Public
  • Source:
  • Public data from ClinicalTrials.gov (scraped via public API)

Next step

Want these clinical trial insights tailored to your domain?

Book a demo