2021-12-25

SQL to dplyr translation - problem with summarise

I am trying to transalte SQL code to R using dplyr library.

My current solution returns proper number of rows (I check it with the result that I obtain from sqldf function call). However there is a problem with the values which are returned and stored in Votes column which are much higher as they should be.

Comments about overall structure of the code would be also appreciated.

SQL:

SELECT
  PostId,
  MAX(CASE WHEN VoteDate = 'new' THEN Total ELSE 0 END) NewVotes, 
  MAX(CASE WHEN VoteDate = 'old' THEN Total ELSE 0 END) OldVotes, 
  SUM(Total) AS Votes
FROM ( 
    SELECT
      PostId,
      CASE STRFTIME('%Y', CreationDate)
        WHEN '2021' THEN 'new' 
        WHEN '2020' THEN 'new' 
        ELSE 'old'
        END VoteDate,
      COUNT(*) AS Total 
    FROM Votes
    WHERE VoteTypeId IN (1, 2, 5)
    GROUP BY PostId, VoteDate 
) AS VotesByAge
GROUP BY VotesByAge.PostId
HAVING NewVotes=0

dplyr:

Votes %>% filter(VoteTypeId %in% c(1,2,5)) %>% 
group_by(PostId, VoteDate = case_when(year(CreationDate) == 2021 | year(CreationDate) == 2020 ~ "new", TRUE ~ "old")) %>% 
mutate(Total = n()) %>% select(PostId, VoteDate, Total) %>% 
group_by(PostId) %>% 
mutate(NewVotes = max(case_when(VoteDate == "new" ~ as.integer(Total), TRUE ~ as.integer(0)))) %>% 
mutate(OldVotes = max(case_when(VoteDate == "old" ~ as.integer(Total), TRUE ~ as.integer(0)))) %>% 
filter(NewVotes == 0) %>% group_by(PostId, NewVotes, OldVotes) %>%
summarise(Votes = sum(Total))

I use data from https://archive.org/details/stackexchange (reduced in size)

structure(list(Id = 1:20, PostId = c(1L, 1L, 2L, 3L, 5L, 2L, 
7L, 6L, 1L, 7L, 2L, 3L, 3L, 11L, 2L, 9L, 3L, 2L, 12L, 16L), VoteTypeId = c(2L, 
2L, 2L, 2L, 5L, 5L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
5L, 2L, 2L), CreationDate = c("2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", 
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", 
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", 
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", 
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", 
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", 
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000"
), UserId = c(NA, NA, NA, NA, 13L, 13L, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 30L, NA, NA), BountyAmount = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c(NA, 
20L), class = "data.frame")


from Recent Questions - Stack Overflow https://ift.tt/3EvNqzD
https://ift.tt/eA8V8J

No comments:

Post a Comment