[Nanocubes-discuss] How to improve performance of a nanocube?
I have been playing around with nanocube for 2 months now. We are working with roughly 8 billion records per day. From the first analysis, we find that plotting 1 billion points into nanocubes from stdout roughly takes 6 hours. Is there anyway, that we can improve this performance?. We are planning to build nanocubes for the entire day within the day itself. Is there anyway I can parallelize this process?
Re: [Nanocubes-discuss] How to improve performance of a nanocube?
This post was updated on .
In case your data is already sorted in time, the short answer is no. In the current codebase we don't support anything extra to speed up the insertion process (assuming your input is already sorted in time). One suggestion that might or might not be feasible in your use case is to use coarser bins (eg. instead of 1 minute time bins, use 1 hour time bins; instead of 25 spatial levels, use 22 spatial levels).
If you are brave and willing to write some code, the simplest way to speed up the creation of a nanocube is to partition your data and do a parallel insert into multiple smaller nanocubes. The drawback of this approach is that at query time you have to merge the results of each small nanocube to produce the final result. In practice, this overhead might not be so bad. We have plans to support parallel insertion, but we don't have a release date at this point. In case you have a rich spatial dimension (points in lots of different places), one interesting way to partition your data with extra benefits in terms of memory usage and insertion speed (compared to a round-robin approach) is done in this project https://github.com/Pyroluk/quadtree_partition