2023-11-17

Building correct API query URL to filter data from clinicaltrials.gov by multiple keywords

I am trying to get some data from a public API and need some help figuring out the correct query syntax for the request URL.

Below is my script. (Never mind fixing or improving the function, it is working well enough so far.)

What I need is the correct query URL.

I would like to get a list of clinical studies from clinicaltrials.gov for search term "EGFR", but narrow the search down so that only results are returned that have "Recruiting" OR "Active, not recruiting" in the "OverallStatus" field. Here are the possible values for the "OverallStatus" field.

I am having a hard time figuring out the API docs. There is a page with Search Expressions and Syntax, but they don't explain how to search for multiple values. How do I build the query string to search for multiple possible values in a field?

I appreciate any insights!

library(tidyverse)
library(httr)
library(jsonlite)
library(glue)


get_studies_df <- function(query_url){
  
  # get clinical studies data
  res <- httr::GET(query_url)
  
  if(!httr::status_code(res) == 200){
    #if request failed return empty data frame
    empty_df <- stats::setNames(data.frame(matrix(ncol = 5, nrow = 0)), c("Rank", "NCTId", "Condition", "BriefTitle", "OverallStatus"))
    return(empty_df)
  }
  
  # get data from response obj
  data <- httr::content(res, as="text", encoding = "UTF-8") %>%
    jsonlite::fromJSON()
  
  # prepare clinical studies data frame
  studies_df <- data$StudyFieldsResponse$StudyFields %>%
    # combine conditions if there is more than one
    dplyr::rowwise() %>%
    mutate(Condition = paste(Condition, collapse = ", ")) %>%
    dplyr::ungroup()
  # unlist data frame columns to show full length text
  for (i in c(1:ncol(studies_df))){
    studies_df[,i] <- unlist(studies_df[,i])
  }
  
  return(studies_df)
  
}
 

### here are all the query strings I tried ###

# get all studies for EGFR (WORKING, but finds 5000+ studies, way too many)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

# get "Recruiting" studies only (WORKING)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

# get "Active" studies only (WORKING)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

### I'm trying to get "Recruiting" OR "Active" studies. These are NOT WORKING ###

# returns only "Active"
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

# returns nothing
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+RANGE[Recruiting,Active]&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

# returns only "Active"
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+AREA[OverallStatus]+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"

# returns everything ("Recruiting", "Completed", "Unknown status", "Active, not recruiting") ??
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+OR+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"


df <- get_studies_df(query_url)

Output table:

enter image description here



No comments:

Post a Comment