2023-03-15

Keep only duplicated rows with a subset

I have a dataframe that I'd like to explore and look only the duplicated rows based on two or more columns.

For example:

df = pl.DataFrame({"A": [1, 6, 5, 4, 5, 6],
                   "B": ["A", "B", "C", "D", "C", "A"],
                 "C": [2, 2, 2, 1, 1, 1]})

I'd like to return duplicate combination for columns A and B only. I've tried:

df.filter(pl.col(["A", "B"]).is_duplicated()) # Returns: This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.

When adding .all() in between, the result is the same as above

df.filter(pl.col(["A", "B"]).all().is_duplicated()) # Same as above

Unique with keep "none" returns the opposite result I'd like to have, so tried the below:

df.unique(subset=["A", "B"], keep="none").is_not() # 'DataFrame' object has no attribute 'is_not'

Expected output would be to see only the rows:

[5, "C", 2]
[5, "C", 1]


No comments:

Post a Comment