Keep only duplicated rows with a subset
I have a dataframe that I'd like to explore and look only the duplicated rows based on two or more columns.
For example:
df = pl.DataFrame({"A": [1, 6, 5, 4, 5, 6],
"B": ["A", "B", "C", "D", "C", "A"],
"C": [2, 2, 2, 1, 1, 1]})
I'd like to return duplicate combination for columns A and B only. I've tried:
df.filter(pl.col(["A", "B"]).is_duplicated()) # Returns: This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.
When adding .all() in between, the result is the same as above
df.filter(pl.col(["A", "B"]).all().is_duplicated()) # Same as above
Unique with keep "none" returns the opposite result I'd like to have, so tried the below:
df.unique(subset=["A", "B"], keep="none").is_not() # 'DataFrame' object has no attribute 'is_not'
Expected output would be to see only the rows:
[5, "C", 2]
[5, "C", 1]
Comments
Post a Comment