Quantcast

[Nanocubes-discuss] Recreating example

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Nanocubes-discuss] Recreating example

Alex Bongiovanni
I'm trying to recreate your BrightKite example, as an exercise to make sure I understand how to load a nanocube and how the drawing aspect works.

I'm confused about all the files in /usr/local/bin named things like nc_q25_u2_u4.  What do these files correspond to?

Following the logic of the example at https://github.com/laurolins/nanocube/wiki I tried to load the BrightKite data, but dropping the device number.  I assumed that I would load nc_q25_u2_u4, q25 being the position, u2_u4 being time; the example had loaded nc_q25_c1_u2_u4, the device number corresponding to c1 being the only difference between the example data and the BrightKite data.

Now, I can get the data to load just fine, if I add a device number to each point of BrightKite data when creating the .dmp file (and my .dmp files are formatted properly I believe).  So what am I doing wrong?  And what schema is associated with the other files?

--
Alex Bongiovanni
University of Maryland
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Nanocubes-discuss] Recreating example

laurolins
Hi Alex,

Each executable corresponds to a different nanocube schema. 

- “q25” indicates a quadtree dimension with 25 levels which in our demos we use it to encode spatial bins. 
- “c1” indicates a categorical dimension with one byte resolution.

The last two parameters “uX_uY” (e.g. u2_u4) indicate respectively the number of bytes to encode “time” and the “count” variable (“u” stands for unsigned integer).

If you are only interested in counts of events in space and time, then a schema q25_u2_u4 would be appropriate (4^25 spatial bins, 2 bytes for time which can encode up to 2^16 bins, cumulative counts of up to 2^32).

In case you want an additional categorical dimension for each event (e.g. “device”) then we need a “cZ” dimension, so a schema q25_c1_u2_u4 would be a valid one. Note that the specific resolution numbers can change to fine tune the maximum resolution bins in each dimension (e.g. q18_u4_u8).

The idea is that the program “ncserve” automatically selects the right nanocube executable to run depending on the header of the input .dmp file. You don’t need to interact directily with the “nc_<params>” programs. The only thing to be aware is that the specific schema you want, might not be built by default. In that case you can build it manually using the python script "ncbuild" (there is an example of this on the wiki page, search for ncbuild).

Lauro


On Jan 15, 2014, at 1:50 PM, Alex Bongiovanni <[hidden email]> wrote:

I'm trying to recreate your BrightKite example, as an exercise to make sure I understand how to load a nanocube and how the drawing aspect works.

I'm confused about all the files in /usr/local/bin named things like nc_q25_u2_u4.  What do these files correspond to?

Following the logic of the example at https://github.com/laurolins/nanocube/wiki I tried to load the BrightKite data, but dropping the device number.  I assumed that I would load nc_q25_u2_u4, q25 being the position, u2_u4 being time; the example had loaded nc_q25_c1_u2_u4, the device number corresponding to c1 being the only difference between the example data and the BrightKite data.

Now, I can get the data to load just fine, if I add a device number to each point of BrightKite data when creating the .dmp file (and my .dmp files are formatted properly I believe).  So what am I doing wrong?  And what schema is associated with the other files?

--
Alex Bongiovanni
University of Maryland
_______________________________________________
Nanocubes-discuss mailing list
[hidden email]
http://mailman.nanocubes.net/mailman/listinfo/nanocubes-discuss_mailman.nanocubes.net

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Nanocubes-discuss] Recreating example

Alex Bongiovanni
That makes a lot more sense now, thank you.  I'm afraid this only raises new questions for me however:

If I was to load an *extremely* large dataset, I would need to create a new schema with ncbuild that uses a larger count variable (and run on a sufficiently powerful system of course)?
If I wanted to load a dataset with 50 categories (for the sake of the overstatement) I could do so, I would just need to use ncbuild to create the appropriate schema?
I read in another post that the maximum category size is 256, could I use ncbuild to define a schema with dimension c2 to increase that size or is it hardcoded?

Also, I'm not sure if you're aware, but your live demos don't seem to be working.


On Wed, Jan 15, 2014 at 3:19 PM, Lauro Lins <[hidden email]> wrote:
Hi Alex,

Each executable corresponds to a different nanocube schema. 

- “q25” indicates a quadtree dimension with 25 levels which in our demos we use it to encode spatial bins. 
- “c1” indicates a categorical dimension with one byte resolution.

The last two parameters “uX_uY” (e.g. u2_u4) indicate respectively the number of bytes to encode “time” and the “count” variable (“u” stands for unsigned integer).

If you are only interested in counts of events in space and time, then a schema q25_u2_u4 would be appropriate (4^25 spatial bins, 2 bytes for time which can encode up to 2^16 bins, cumulative counts of up to 2^32).

In case you want an additional categorical dimension for each event (e.g. “device”) then we need a “cZ” dimension, so a schema q25_c1_u2_u4 would be a valid one. Note that the specific resolution numbers can change to fine tune the maximum resolution bins in each dimension (e.g. q18_u4_u8).

The idea is that the program “ncserve” automatically selects the right nanocube executable to run depending on the header of the input .dmp file. You don’t need to interact directily with the “nc_<params>” programs. The only thing to be aware is that the specific schema you want, might not be built by default. In that case you can build it manually using the python script "ncbuild" (there is an example of this on the wiki page, search for ncbuild).

Lauro


On Jan 15, 2014, at 1:50 PM, Alex Bongiovanni <[hidden email]> wrote:

I'm trying to recreate your BrightKite example, as an exercise to make sure I understand how to load a nanocube and how the drawing aspect works.

I'm confused about all the files in /usr/local/bin named things like nc_q25_u2_u4.  What do these files correspond to?

Following the logic of the example at https://github.com/laurolins/nanocube/wiki I tried to load the BrightKite data, but dropping the device number.  I assumed that I would load nc_q25_u2_u4, q25 being the position, u2_u4 being time; the example had loaded nc_q25_c1_u2_u4, the device number corresponding to c1 being the only difference between the example data and the BrightKite data.

Now, I can get the data to load just fine, if I add a device number to each point of BrightKite data when creating the .dmp file (and my .dmp files are formatted properly I believe).  So what am I doing wrong?  And what schema is associated with the other files?

--
Alex Bongiovanni
University of Maryland
_______________________________________________
Nanocubes-discuss mailing list
[hidden email]
http://mailman.nanocubes.net/mailman/listinfo/nanocubes-discuss_mailman.nanocubes.net


_______________________________________________
Nanocubes-discuss mailing list
[hidden email]
http://mailman.nanocubes.net/mailman/listinfo/nanocubes-discuss_mailman.nanocubes.net




--
Alex Bongiovanni
University of Maryland
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Nanocubes-discuss] Recreating example

Carlos Scheidegger
Wow, thanks for that. They were working as of a few hours ago. I'm going to check.
-carlos

On Jan 16, 2014, at 8:42 AM, Alex Bongiovanni <[hidden email]> wrote:

Also, I'm not sure if you're aware, but your live demos don't seem to be working.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Nanocubes-discuss] Recreating example

laurolins
In reply to this post by Alex Bongiovanni
Hi Alex,

> That makes a lot more sense now, thank you.  I'm afraid this only raises new questions for me however:
>
> If I was to load an *extremely* large dataset, I would need to create a new schema with ncbuild that uses a larger count variable (and run on a sufficiently powerful system of course)?

Yes. If your counter will exceed 2^32-1 than you need a larger storage space for it. You can use u1, u2, u3, u4, u5, u6, u6, u8 as the “count” storage space. A schema like nc_q25_u2_u8 is not built by default, so, yes, you would need to run ncbuild.

> If I wanted to load a dataset with 50 categories (for the sake of the overstatement) I could do so, I would just need to use ncbuild to create the appropriate schema?

Yes, but there are two problems with this:

1. We are using boost::mpl to encode vectors of types and I believe 50 entries maybe too much. So in practice we haven’t played with nanocubes with more than 7-8 dimensions.
2. When adding dimensions, the growth of the possible bins for which we need to pre-compute data is potentially exponential (size of the data structure can grow quickly).

(We are thinking in a way of solving 1 for a future release.)

> I read in another post that the maximum category size is 256, could I use ncbuild to define a schema with dimension c2 to increase that size or is it hardcoded?

Actually for a categorical variable if we use 1-byte (or. c1) the numbers 0, …, 254 can be used to represent categories and 255 is used to represent the aggregate bin of all categories (in that dimension).
If you use 2-bytes (or c2) than the numbers 0,…,65534 can be used to represent categories and 65535 is used to represent the aggregate bin of all categories (in that dimension). I have done some tests using c2 and it should work. So the answer is yes, you can use c2 with the master branch of nanocubes (on the 1.0 branch we don’t have that possibility, but I see you are using the master branch).

> Also, I'm not sure if you're aware, but your live demos don't seem to be working.
>

We are looking into that. Thank you, Alex.

Best,
Lauro

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Nanocubes-discuss] Recreating example

Alex Bongiovanni
Thank you, you have thoroughly explained everything.


On Thu, Jan 16, 2014 at 9:20 AM, Lauro Lins <[hidden email]> wrote:
Hi Alex,

> That makes a lot more sense now, thank you.  I'm afraid this only raises new questions for me however:
>
> If I was to load an *extremely* large dataset, I would need to create a new schema with ncbuild that uses a larger count variable (and run on a sufficiently powerful system of course)?

Yes. If your counter will exceed 2^32-1 than you need a larger storage space for it. You can use u1, u2, u3, u4, u5, u6, u6, u8 as the “count” storage space. A schema like nc_q25_u2_u8 is not built by default, so, yes, you would need to run ncbuild.

> If I wanted to load a dataset with 50 categories (for the sake of the overstatement) I could do so, I would just need to use ncbuild to create the appropriate schema?

Yes, but there are two problems with this:

1. We are using boost::mpl to encode vectors of types and I believe 50 entries maybe too much. So in practice we haven’t played with nanocubes with more than 7-8 dimensions.
2. When adding dimensions, the growth of the possible bins for which we need to pre-compute data is potentially exponential (size of the data structure can grow quickly).

(We are thinking in a way of solving 1 for a future release.)

> I read in another post that the maximum category size is 256, could I use ncbuild to define a schema with dimension c2 to increase that size or is it hardcoded?

Actually for a categorical variable if we use 1-byte (or. c1) the numbers 0, …, 254 can be used to represent categories and 255 is used to represent the aggregate bin of all categories (in that dimension).
If you use 2-bytes (or c2) than the numbers 0,…,65534 can be used to represent categories and 65535 is used to represent the aggregate bin of all categories (in that dimension). I have done some tests using c2 and it should work. So the answer is yes, you can use c2 with the master branch of nanocubes (on the 1.0 branch we don’t have that possibility, but I see you are using the master branch).

> Also, I'm not sure if you're aware, but your live demos don't seem to be working.
>

We are looking into that. Thank you, Alex.

Best,
Lauro
_______________________________________________
Nanocubes-discuss mailing list
[hidden email]
http://mailman.nanocubes.net/mailman/listinfo/nanocubes-discuss_mailman.nanocubes.net



--
Alex Bongiovanni
University of Maryland
Loading...