SQL to dplyr translation - problem with summarise
I am trying to transalte SQL code to R using dplyr library.
My current solution returns proper number of rows (I check it with the result that I obtain from sqldf function call). However there is a problem with the values which are returned and stored in Votes column which are much higher as they should be.
Comments about overall structure of the code would be also appreciated.
SQL:
SELECT
PostId,
MAX(CASE WHEN VoteDate = 'new' THEN Total ELSE 0 END) NewVotes,
MAX(CASE WHEN VoteDate = 'old' THEN Total ELSE 0 END) OldVotes,
SUM(Total) AS Votes
FROM (
SELECT
PostId,
CASE STRFTIME('%Y', CreationDate)
WHEN '2021' THEN 'new'
WHEN '2020' THEN 'new'
ELSE 'old'
END VoteDate,
COUNT(*) AS Total
FROM Votes
WHERE VoteTypeId IN (1, 2, 5)
GROUP BY PostId, VoteDate
) AS VotesByAge
GROUP BY VotesByAge.PostId
HAVING NewVotes=0
dplyr:
Votes %>% filter(VoteTypeId %in% c(1,2,5)) %>%
group_by(PostId, VoteDate = case_when(year(CreationDate) == 2021 | year(CreationDate) == 2020 ~ "new", TRUE ~ "old")) %>%
mutate(Total = n()) %>% select(PostId, VoteDate, Total) %>%
group_by(PostId) %>%
mutate(NewVotes = max(case_when(VoteDate == "new" ~ as.integer(Total), TRUE ~ as.integer(0)))) %>%
mutate(OldVotes = max(case_when(VoteDate == "old" ~ as.integer(Total), TRUE ~ as.integer(0)))) %>%
filter(NewVotes == 0) %>% group_by(PostId, NewVotes, OldVotes) %>%
summarise(Votes = sum(Total))
I use data from https://archive.org/details/stackexchange (reduced in size)
structure(list(Id = 1:20, PostId = c(1L, 1L, 2L, 3L, 5L, 2L,
7L, 6L, 1L, 7L, 2L, 3L, 3L, 11L, 2L, 9L, 3L, 2L, 12L, 16L), VoteTypeId = c(2L,
2L, 2L, 2L, 5L, 5L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
5L, 2L, 2L), CreationDate = c("2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000",
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000",
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000",
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000",
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000",
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000",
"2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000", "2011-06-21T00:00:00.000"
), UserId = c(NA, NA, NA, NA, 13L, 13L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 30L, NA, NA), BountyAmount = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c(NA,
20L), class = "data.frame")
from Recent Questions - Stack Overflow https://ift.tt/3EvNqzD
https://ift.tt/eA8V8J
Comments
Post a Comment