2023-09-22

Graph (networkit) - create edges from the list of duplicated records for any columns pair in pandas

I'm trying to create graph with edges only for nodes/(records index in dataframe) that have the same values in any 2 or more columns.
What I'm doing - I create a list with all possible combination pairs of column names and go through them searching for duplicates, for which I extract indexes and create edges.
The problem is that for huge datasets (millions of records) - this solution is too slow and requires too much memory.

What I do:

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [1, 1, 1, 1, 2],
    'C': [1, 1, 2, 3, 3],
    'D': [2, 7, 9, 8, 4]})  
A B C D
0 1 1 1 2
1 2 1 1 7
2 3 1 2 9
3 4 1 3 8
4 5 2 3 4

Here, rows 0 and 1 have 2 same values in columns B and C.
So, for nodes 0,1,2,3,4 I need to create edge 0-1. Other records have at maximum 1 same field between each other.

    graph = nk.Graph(num_nodes, directed=False, weighted=False)

    # Get the indices of all unique pairs
    indices = np.triu_indices(len(column_names), k=1)
    # Get the unique pairs of column names
    unique_pairs = np.column_stack((column_names[indices[0]], column_names[indices[1]]))

    for col1, col2 in unique_pairs:
        # Filter the dataframe directly
        duplicated_rows = df[[col1, col2]].dropna()
        duplicated_rows = duplicated_rows[duplicated_rows.duplicated(subset=[col1, col2], keep=False)]

    for _, group in duplicated_rows.groupby([col1, col2]):
        tb_ids = group.index.tolist()
        for i in range(len(tb_ids)):
            for j in range(i + 1, len(tb_ids)):
                graph.addEdge(tb_ids[i], tb_ids[j])

Main question - how to speed up / improve this solution? I was thinking about parallelization by column combination - but in this case can't figure out how to create edges in a graph properly.
Appreciate any help.



No comments:

Post a Comment