2023-01-22

Improve Pandas performance for very large dataframes?

I have a few Pandas dataframes with several millions of rows each. The dataframes have columns containing JSON objects each with 100+ fields. I have a set of 24 functions that run sequentially on the dataframes, process the JSON (for example, compute some string distance between two fields in the JSON) and return a JSON with some new fields added. After all 24 functions execute, I get a final JSON which is then usable for my purposes.

I am wondering what the best ways to speed up performance for this dataset. A few things I have considered and read up on:

  • It is tricky to vectorize because many operations are not as straightforward as "subtract this column's values from another column's values".
  • I read up on some of the Pandas documentation and a few options indicated are Cython (may be tricky to convert the string edit distance to Cython, especially since I am using an external Python package) and Numba/JIT (but this is mentioned to be best for numerical computations only).
  • Possibly controlling the number of threads could be an option. The 24 functions can mostly operate without any dependencies on each other.


No comments:

Post a Comment