Graph (networkit) - create edges from the list of duplicated records for any columns pair in pandas
I'm trying to create graph with edges only for nodes/(records index in dataframe) that have the same values in any 2 or more columns.
What I'm doing - I create a list with all possible combination pairs of column names and go through them searching for duplicates, for which I extract indexes and create edges.
The problem is that for huge datasets (millions of records) - this solution is too slow and requires too much memory.
What I do:
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [1, 1, 1, 1, 2],
'C': [1, 1, 2, 3, 3],
'D': [2, 7, 9, 8, 4]})
A | B | C | D | |
---|---|---|---|---|
0 | 1 | 1 | 1 | 2 |
1 | 2 | 1 | 1 | 7 |
2 | 3 | 1 | 2 | 9 |
3 | 4 | 1 | 3 | 8 |
4 | 5 | 2 | 3 | 4 |
Here, rows 0 and 1 have 2 same values in columns B and C.
So, for nodes 0,1,2,3,4 I need to create edge 0-1. Other records have at maximum 1 same field between each other.
graph = nk.Graph(num_nodes, directed=False, weighted=False)
# Get the indices of all unique pairs
indices = np.triu_indices(len(column_names), k=1)
# Get the unique pairs of column names
unique_pairs = np.column_stack((column_names[indices[0]], column_names[indices[1]]))
for col1, col2 in unique_pairs:
# Filter the dataframe directly
duplicated_rows = df[[col1, col2]].dropna()
duplicated_rows = duplicated_rows[duplicated_rows.duplicated(subset=[col1, col2], keep=False)]
for _, group in duplicated_rows.groupby([col1, col2]):
tb_ids = group.index.tolist()
for i in range(len(tb_ids)):
for j in range(i + 1, len(tb_ids)):
graph.addEdge(tb_ids[i], tb_ids[j])
Main question - how to speed up / improve this solution? I was thinking about parallelization by column combination - but in this case can't figure out how to create edges in a graph properly.
Appreciate any help.
Comments
Post a Comment