2021-06-29

Python & Time Series Cleaning - Removing regions/chunks of huge timeseries that contain no useful data

Data: 20 hrs of triaxial accelerometer data. Sampled at 10khz. 3 Billion Points. 40 Gb of CSV spread across 20 files to reduce memory needed to inspect a given chunk of data.

Accelerometer Example

Problem: There are large regions of low/no-activity that I don't care about which make up the greater majority of the data points. There is no reason these regions need to be preserved. This slows down all the processing. Plotting is just the beginning, I need to do signals processing and complex transformations on the data which is very computationally intensive. Also these files are unreasonably large and I will be doing this sort of testing many times.

Data is being stored in a four-column Dask dataframe to be plotted using holoviews with a matplotlib extension (and perhaps an interactive Bokeh dashboard once data size can be reduced). I want to reduce the amount of data without destroying regions of interest. The easiest way to do this is to remove extended periods of random noise between regions of activity.

What I've Tried:

  • Manually editing the data and removing regions by locating their indices on charts. This takes forever and is not something I want to repeat the next time I need to process this sort of data.
  • Filtering using various signal processing methods. This significantly changes the regions of interest and destroys information.

Ideas for Solving:

  • Iterating over the rows and creating boolean columns which indicate if the abs values have exceeded a threshold. Then deleting rows of consecutive 0's (below threshold) which are more rows than some chunk length N up to the next "1". I would make N sufficiently long to prevent destroying points in the regions of interest.
  • Instead of deleting regions implementing something like Ramer–Douglas–Peucker algorithm. But this would take an inordinate amount to time to iterate over the data. I also don't know who I would preserve the timestamp column as (I believe) the data would need to be transformed into series.

I'm most certainly not a computer scientist and I feel as though this must be a solved problem. I'm not aware of a solution and much searching has yielded nothing that would work for tabular time series.

And while it would be great to do this in python and just make it a part of my process before saving the graphs (which take about 5-10 minutes to plot and have a ton of wasted space) I am open to any free solutions, scripts, or programs which would let me edit out the regions automatically or manually with a UI (I'm almost imagining a video editor style UI where I can drag select regions.)



from Recent Questions - Stack Overflow https://ift.tt/2Udc8TW
https://ift.tt/eA8V8J

No comments:

Post a Comment