Improve Pandas performance for very large dataframes?

I have a few Pandas dataframes with several millions of rows each. The dataframes have columns containing JSON objects each with 100+ fields. I have a set of 24 functions that run sequentially on the dataframes, process the JSON (for example, compute some string distance between two fields in the JSON) and return a JSON with some new fields added. After all 24 functions execute, I get a final JSON which is then usable for my purposes.

I am wondering what the best ways to speed up performance for this dataset. A few things I have considered and read up on:

  • It is tricky to vectorize because many operations are not as straightforward as "subtract this column's values from another column's values".
  • I read up on some of the Pandas documentation and a few options indicated are Cython (may be tricky to convert the string edit distance to Cython, especially since I am using an external Python package) and Numba/JIT (but this is mentioned to be best for numerical computations only).
  • Possibly controlling the number of threads could be an option. The 24 functions can mostly operate without any dependencies on each other.


Comments

Popular posts from this blog

Today Walkin 14th-Sept

Spring Elasticsearch Operations

Hibernate Search - Elasticsearch with JSON manipulation