CSV binning of large text files: records #limits?

classic Classic list List threaded Threaded
5 messages Options
TK
Reply | Threaded
Open this post in threaded view
|

CSV binning of large text files: records #limits?

TK
Dear creators and users

I have a question concerning the operation of "nanocube-binning-csv" script.

I'm loading a text CSV file containing ~20 300 000 records. Unfortunately, the results are incorrect, that is: the values (strings) of a column are not recognized properly or the whole records are ignored (I suspect that based on tbin value).

To be more specific: either the tbin value is incorrect (starts 3 years after the year of first record's timestamp) - for unsorted CSV file
or
the categories are recognized incorrectly (that is: in a file of 2/1/0.5/0.25/0.1/0.01 mln records only 2 different values of a selected column are recognized while when testing manually, just over ~37 000 the third and more others are found).

Example: we have a text CSV file containing 20 mln of records, with latitude, longitude and timestamp columns along with a column "name".
Issue #1: when loading an unsorted CSV file of 2/1/0.5/0.25/0.1/0.01 mln of records, all names in the dataset are recognized (e.g. Dave, James, Brian, Judy, Carrie, Mary and Edward) but the beginning of the timescale (meaning the "metadata: tbin" field's value in DMP file is "2009-01-01_00:00:00_3600s" even that Judy was born in 2006.
Issue #2: when loading a sorted CSV file, only two names (Brian and Dave) were found and put into DMP file (even though Judy's record/name is located just in the row #37k of say 1 mln). A good thing is that the tbin says "metadata: tbin 2006-01-01_00:00:00_3600s".

Did you encounter such a behaviour or do you recognize some mistake I might be making? At first, I was thinking I made a mistake in my pre-processing of those files or due to sorting stage/wrong new line characters/terminators, anything else. But after triple-checking I've came here for advice.

I'm not sure if it might be crucial but when launching "nanocube-binning-csv" script, I get those warnings:

"/home/x/nanocube-3.2.1/bin/nanocube-binning-csv:259: FutureWarning: the coerce=True keyword is deprecated, use errors='coerce' instead
  data[d] = pd.to_datetime(data[d],coerce=True)
/home/x/nanocube-3.2.1/bin/nanocube-binning-csv:273: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  return data.sort(self.timecol)"
and my version of nc is the one from github readme page (by "wget https://github.com/laurolins/nanocube/archive/3.2.1.zip"). I've noticed recent commits, would you recommend master version instead?

And does the "nanocube-binning-csv" script have some behaviour/limits for the input files? Maybe some requirements regarding formatting?

Best regards
TK
Reply | Threaded
Open this post in threaded view
|

Re: [Nanocubes-discuss] CSV binning of large text files: records #limits?

salivian
Hi TK,

The binning script processes the csv file in chunks, if the first record does not appear in the first chunk, the program will guess for a wrong offset.  For an unsorted file, you may try the --offset option
​e.g. --offset=2007-01-01, so you force the time offset to start at 2007-01-01

This new option and together with pandas coerce=True fixes are now in the master branch of github. Please pull the HEAD version

Thanks!


Horace


_______________________________________________
Nanocubes-discuss mailing list
[hidden email]
http://mailman.nanocubes.net/mailman/listinfo/nanocubes-discuss_mailman.nanocubes.net
TK
Reply | Threaded
Open this post in threaded view
|

Re: [Nanocubes-discuss] CSV binning of large text files: records #limits?

TK
Hi Horace, thank you for a quick reply!

I pulled the HEAD version, the "coerce" warning disappeared.
I used the "offset" option which fixed the time value. I'm loading the data to memory now to see how will it perform with the whole dataset in action.

For a second, I suspected that maybe not all records are being processed by the "nanocube-binning-csv" script. But then I used the "nanocube-view-dmp" script on both output DMP files:
* the one coming from processing of unsorted records (with corrected timestamp) and
* the one coming from processing of sorted records (where only 2 of 7 distinct values in the column were recognized by the binning script)

The view-dmp script showed that the number of records in DMP/processed format is equal to the original number of records. Then probably the only issue now is the binning script not recognizing all distinct values present in the [category column of] sorted file.

Why do I insist on sorted files instead of unsorted which probably works (what I'll be able to confirm after the whole DMP will finally get loaded into memory and I'll be able to explore it) ? It's due to the number of records I'm planning to process, 20 mln is not max yet and for what I've found so far (I think it was written here, on mailing list) loading records sorted by time dimension is just faster.

Do you think that the only reason only 2 of 7 unique values of category column are recognized by binning script is that they don't appear in X first rows? Can I hm, add them using a specific option or "cheating" with some dummy entries at the beginning?

Best regards
TK
Reply | Threaded
Open this post in threaded view
|

Re: [Nanocubes-discuss] CSV binning of large text files: records #limits?

salivian
The issue is chunking ... we are trying to infer the header from the first chunk only ...

1. You can try to increase --chunksize so it gets more of the variable
2. Since you can specify the header by the --ncheader <header file> argument.

Try to take the incomplete head and add the missing values in the value map.

I hope this helps.


Thanks!

Horace


On Fri, Jan 8, 2016 at 5:14 AM, Tomasz Kosiński <[hidden email]> wrote:
Hi Horace, thank you for a quick reply!

I pulled the HEAD version, the "coerce" warning disappeared.
I used the "offset" option which fixed the time value. I'm loading the data
to memory now to see how will it perform with the whole dataset in action.

For a second, I suspected that maybe not all records are being processed by
the "nanocube-binning-csv" script. But then I used the "nanocube-view-dmp"
script on both output DMP files:
* the one coming from processing of unsorted records (with corrected
timestamp) and
* the one coming from processing of sorted records (where only 2 of 7
distinct values in the column were recognized by the binning script)

The view-dmp script showed that the number of records in DMP/processed
format is equal to the original number of records. Then probably the only
issue now is the binning script not recognizing all distinct values present
in the [category column of] sorted file.

Why do I insist on sorted files instead of unsorted which probably works
(what I'll be able to confirm after the whole DMP will finally get loaded
into memory and I'll be able to explore it) ? It's due to the number of
records I'm planning to process, 20 mln is not max yet and for what I've
found so far (I think it was written here, on mailing list) loading records
sorted by time dimension is just faster.

Do you think that the only reason only 2 of 7 unique values of category
column are recognized by binning script is that they don't appear in X first
rows? Can I hm, add them using a specific option or "cheating" with some
dummy entries at the beginning?

Best regards
TK



--
View this message in context: http://nanocubes-discuss.64146.x6.nabble.com/CSV-binning-of-large-text-files-records-limits-tp188p190.html
Sent from the nanocubes-discuss mailing list archive at Nabble.com.

_______________________________________________
Nanocubes-discuss mailing list
[hidden email]
http://mailman.nanocubes.net/mailman/listinfo/nanocubes-discuss_mailman.nanocubes.net


_______________________________________________
Nanocubes-discuss mailing list
[hidden email]
http://mailman.nanocubes.net/mailman/listinfo/nanocubes-discuss_mailman.nanocubes.net
TK
Reply | Threaded
Open this post in threaded view
|

Re: [Nanocubes-discuss] CSV binning of large text files: records #limits?

TK
[using HEAD version]

You're right, it solves the problem, thank you for pointing it out. Also, I've finally noticed that those information were in $NANOCUBE_SRC/scripts/README.md file all the time and I should've read it before posting.

I solved the problem increasing chunksize or providing ncheader file (even though I didn't load the whole dmp with those results yet).

After fixing the issue I've noticed a new behaviour: increasing the chunksize to 1 or 2 mln ((1|2) 000 000), the nanocube-leaf works surprisingly quick until around 61% of the records # and then slows down to it's "usual" loading/aggregating speed. Is it a normal/expected behaviour? I think it's related to fos's question in "Hiccup in speed of building nanocube" thread.

I'm asking partially out of curiosity and partially due to the need of decreasing the time needed to load DMP file for exploration via web interface or minimizing it extremely. For the latter, I was going to use Cryopid/Cryopid2 (ver. 0.6.9 from sourceforge). I've just spend some time on resolving dependencies and it looks like there's a few symlinks needed for it but going deeper it doesn't seem to be that compatible with today's OS'es (debian tested/fedora planned). I'm not sure if it's worth it, maybe there's another, known and working solution to that? I've seen also CRIU until now.

Then my question would be: if loading the DMP file so quickly as described two paragraphs above is "normal", would you recommend anything else besides sorting the records by timestamp (for binning, as advised in one of the threads) or the process "freezers" like Cryopid for making nanocube ready for users quicker than the "usual" loading time takes?

I suppose I might be making some significant mistake as for a single categorical variable and just 20 000 000 records aggregated I'm getting significant "loading" time (by nanocube-leaf), version 3.2.2 while kartheek muthyala in the thread "How to improve performance of a nanocube?" says 1 billion points "loading" time is 6 hrs. At the same time, it took me over 22 hrs to load 16 mln of records.
If I'm not making a mistake here and it depends on X (like the # of distinct values within category/column), should I go with Lauro's advice from the mentioned thread and try to partition the data into smaller nanocubes? The spatial dimension for data I'm testing right now (and probably most of the ones coming next for what I know now) are not really that spread around the globe.