2022-09-16

Iterate through column names to get different type of functions summarized by week in r dataframe using dplyr

I am trying to iterate through global health epidemic data on a database which consists of daily cases, cumulative cases, daily deaths, and cumulative deaths (as well as some other covariables which aren't really relevant here). The table is structured as follows: For each country (with country name listed, region, ID) and each date (though not all dates are displayed for all countries*) the daily/cumulative cases/deaths/etc. are listed.

The data looks something like this:

# A tibble: 40 x 7
   iso_code continent location    date       total_cases new_cases week   
   <chr>    <chr>     <chr>       <date>           <dbl>     <dbl> <chr>  
 1 AFG      Asia      Afghanistan 2020-02-24           5         5 2020-08
 2 AFG      Asia      Afghanistan 2020-02-25           5         0 2020-08
 3 AFG      Asia      Afghanistan 2020-02-26           5         0 2020-08
 4 AFG      Asia      Afghanistan 2020-02-27           5         0 2020-08
 5 AFG      Asia      Afghanistan 2020-02-28           5         0 2020-08
 6 AFG      Asia      Afghanistan 2020-02-29           5         0 2020-08
 7 AFG      Asia      Afghanistan 2020-03-01           5         0 2020-09
 8 AFG      Asia      Afghanistan 2020-03-02           5         0 2020-09
 9 AFG      Asia      Afghanistan 2020-03-03           5         0 2020-09
10 AFG      Asia      Afghanistan 2020-03-04           5         0 2020-09
# ... with 30 more rows

I need to summarize the daily data into weekly data. Of course, this is no problem for one column: using methods described here I should be able to aggregate the data for each week, for each country as follows~

library(dplyr)
sumByColumn <- function(df, colName) {
# the method for daily (cases/deaths)/(cases/deaths) smoothed
  df %>%
    group_by(location, week) %>%
    summarize(colName = sum(!! sym(colName)))
}
idByColumn <- function(df, colName) {
# the method for cumulative (cases/deaths)
  df %>%
    group_by(location, week) %>%
    summarize(colName = identity(!! sym(colName)))
}

(It should be noted that, obviously, daily case/death data will be summarized, whereas cumulative case/death data will be simply the identity function as given. These columns, in the list of column names of df, are denoted as id_cols.)

However, when I try to run the sumByColumn()/idByColumn() loop along the entire dataframe df, I run into this error:

for (col in 1:ncol(df)) {
  colName = colnames(df)[col]
  if (col%in%id_cols) {
    df_weekly = idByColumn(df_weekly,colName)
  } else {
    df_weekly = sumByColumn(df_weekly,colName)
  }
}

I get:

Error in !sym(colName) : invalid argument type

Note: I have computed the frequency by which the number of times each country appears in the dataframe, which corresponds to the number of days the disease was tracked. Is there a way to account for this, e.g. when I go through the weeks, if there is no data for that week, or an uneven number of countries per week give data, to ignore it and not return NA?

916
916
910
892
884
899
971
938
899
946 


No comments:

Post a Comment