[Nanocubes-discuss] Histograms and category queries

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[Nanocubes-discuss] Histograms and category queries

Alex Bongiovanni
I'm having some issues with the histograms and the category queries, Specifically when there are two (or more?) histograms at one time, and when you select more than one item in the histogram.

In your demo with the BrightKite data, if you select multiple things from one histogram, it only uses the selection that is lowest in index.  So if you select both 'Mon' and 'Tue', it will act as though only 'Mon' is selected, even though that isn't true, or what I would guess the correct behavior is supposed to be (though I could be wrong).

I reworked the histogram code for the master branch and attempted to make it so that selecting two things uses a sequence query, but I found that what was returned from the server was not the same depending on the number of histograms and which one you were selecting from (eg for two histograms of 7 values each where you select two values at a time, selecting from one histogram will give an array of size 7 where each element is an array of 2 values, and selecting from the other histogram will give an array of size 2 where both elements are arrays with 7 values).

All this to say, how ought I to be interpreting the response from the server?  Are the values returned in order, or should I be using the addresses paired with them (which sometimes are all the same)?  How do I know how to parse if I've selected 2+ histogram items?

Also, I believe the reason that my time-series was so radically different from yours was that I was using daily resolution, while you used weekly.

--
Alex Bongiovanni
University of Maryland
Reply | Threaded
Open this post in threaded view
|

Re: [Nanocubes-discuss] Histograms and category queries

laurolins
Hi Alex,

In your demo with the BrightKite data, if you select multiple things from one histogram, it only uses the selection that is lowest in index.  So if you select both 'Mon' and 'Tue', it will act as though only 'Mon' is selected, even though that isn't true, or what I would guess the correct behavior is supposed to be (though I could be wrong).

Yes, Alex. I was able to reproduce this bug and opened up an issue on github.

I reworked the histogram code for the master branch and attempted to make it so that selecting two things uses a sequence query, but I found that what was returned from the server was not the same depending on the number of histograms and which one you were selecting from (eg for two histograms of 7 values each where you select two values at a time, selecting from one histogram will give an array of size 7 where each element is an array of 2 values, and selecting from the other histogram will give an array of size 2 where both elements are arrays with 7 values).

All this to say, how ought I to be interpreting the response from the server?  Are the values returned in order, or should I be using the addresses paired with them (which sometimes are all the same)?  How do I know how to parse if I've selected 2+ histogram items?

For now the results are out-of-order and the client (e.g. your front-end code) has to sort them using the address field. The syntax for pulling a time series is

/@<time_dim_name>=<time_bin_0>:<bucket_size>:<number_of_buckets>

So the number of time series elements you get as a result (and out-of-order) is the number you pass on the <number_of_buckets> field. Each time series element will consist of an aggregate (e.g. count) of the number of “events” in <bucket_size> time bins, and the first “bucket” starts at <time_bin_0>. The bottom of the wiki page https://github.com/laurolins/nanocube/wiki has an example on how to map a time bucket address into its [a,b) time bin interval.


Also, I believe the reason that my time-series was so radically different from yours was that I was using daily resolution, while you used weekly.


Yes, you are right. The time series plot is using weekly data as its time binning scheme.

Thank you, Alex.
Lauro
Reply | Threaded
Open this post in threaded view
|

Re: [Nanocubes-discuss] Histograms and category queries

Alex Bongiovanni
My time series works fine, my problem is that in trying to sort histogram data by address, there isn't always a consistent address scheme (insofar as I can tell).

When I dumped the BK data I had my script also calculate the day of the week and month and add them as categories.  So I have two histograms, one for each category - day of week and hour of day.  I'll refer to the day of week histogram as H1, and hour of day histogram as H2 for clarity (this is the order in which they are displayed).

The values for H1 with no constraints have address and value pairs as expected:
{"addr":"0","children":[{"addr":"0","value":34818},{"addr":"1","value":33449},{"addr":"2","value":35449},{"addr":"3","value":37216},{"addr":"4","value":39260},{"addr":"5","value":36685},{"addr":"6","value":33122}]}

But if I pick a constraint from H2, the values for H1 that get returned do not have an address with the value (its up one level):
{"addr":"0","children":[{"addr":"0","children":[{"addr":"0","value":1943}]},{"addr":"1","children":[{"addr":"0","value":2065}]},{"addr":"2","children":[{"addr":"0","value":2145}]},{"addr":"3","children":[{"addr":"0","value":2258}]},{"addr":"4","children":[{"addr":"0","value":2327}]},{"addr":"5","children":[{"addr":"0","value":2551}]},{"addr":"6","children":[{"addr":"0","value":2149}]}]}

And, if instead I select my constraint from H1 (the reverse situation), the returned values for H2 are formatted as would be expected, the addresses with the values:
{"addr":"0","children":[{"addr":"0","children":[{"addr":"15","value":1844},{"addr":"14","value":1802},{"addr":"13","value":1959},{"addr":"12","value":2230},{"addr":"11","value":2217},{"addr":"10","value":2295},{"addr":"f","value":1860},{"addr":"e","value":1807},{"addr":"d","value":1189},{"addr":"c","value":631},{"addr":"b","value":467},{"addr":"a","value":429},{"addr":"9","value":523},{"addr":"8","value":615},{"addr":"7","value":742},{"addr":"6","value":869},{"addr":"5","value":1087},{"addr":"4","value":1319},{"addr":"3","value":1568},{"addr":"2","value":1836},{"addr":"1","value":1799},{"addr":"0","value":1943},{"addr":"16","value":1803},{"addr":"17","value":1984}]}]}

So it seems there is something going on serverside with which dimension is being queried that affects what gets returned.  Selecting two things from a single histogram has different result yet, but lets take this one step at a time.

I guess my point is that selecting from different histograms returns different things, and it doesn't seem like the correct behavior.


On Fri, Feb 7, 2014 at 1:40 PM, Lauro Lins <[hidden email]> wrote:
Hi Alex,

In your demo with the BrightKite data, if you select multiple things from one histogram, it only uses the selection that is lowest in index.  So if you select both 'Mon' and 'Tue', it will act as though only 'Mon' is selected, even though that isn't true, or what I would guess the correct behavior is supposed to be (though I could be wrong).

Yes, Alex. I was able to reproduce this bug and opened up an issue on github.

I reworked the histogram code for the master branch and attempted to make it so that selecting two things uses a sequence query, but I found that what was returned from the server was not the same depending on the number of histograms and which one you were selecting from (eg for two histograms of 7 values each where you select two values at a time, selecting from one histogram will give an array of size 7 where each element is an array of 2 values, and selecting from the other histogram will give an array of size 2 where both elements are arrays with 7 values).

All this to say, how ought I to be interpreting the response from the server?  Are the values returned in order, or should I be using the addresses paired with them (which sometimes are all the same)?  How do I know how to parse if I've selected 2+ histogram items?

For now the results are out-of-order and the client (e.g. your front-end code) has to sort them using the address field. The syntax for pulling a time series is

/@<time_dim_name>=<time_bin_0>:<bucket_size>:<number_of_buckets>

So the number of time series elements you get as a result (and out-of-order) is the number you pass on the <number_of_buckets> field. Each time series element will consist of an aggregate (e.g. count) of the number of “events” in <bucket_size> time bins, and the first “bucket” starts at <time_bin_0>. The bottom of the wiki page https://github.com/laurolins/nanocube/wiki has an example on how to map a time bucket address into its [a,b) time bin interval.


Also, I believe the reason that my time-series was so radically different from yours was that I was using daily resolution, while you used weekly.


Yes, you are right. The time series plot is using weekly data as its time binning scheme.

Thank you, Alex.
Lauro

_______________________________________________
Nanocubes-discuss mailing list
[hidden email]
http://mailman.nanocubes.net/mailman/listinfo/nanocubes-discuss_mailman.nanocubes.net




--
Alex Bongiovanni
University of Maryland
Reply | Threaded
Open this post in threaded view
|

Re: [Nanocubes-discuss] Histograms and category queries

laurolins
Hi Alex,
Can you include the query URLs to the examples you just sent?

In the meantime here is a comment that might (or might not) clarify things a bit. All results in the master branch are in the form of a tree with values only at the leaves, so if you write something to drill down on hour and weekday dimensions

query/@weekday=255+1/@hour=255+1           (forgive this 255 ugly label for the root node of c1 dimensions)

every path root-leaf consists of a root node connected by an edge labeled with an weekday-address/label (e.g. Mon) connected by an edge with an hour-address/label (e.g. 9h) to a leaf node that has a value (e.g. counts of Mondays at 9h). Potentially this query could result in 24 x 7 leaves if they all have data. The order in which the addresses of children nodes are listed in the result is not sorted.

Lauro

P.S. The order of the dimensions for each layer of the resulting tree is given in the resulting json object in the field “levels”. Note that although we have written the weekday dimension first in the query above, that doesn’t mean that the “weekday” level on the tree comes before the “hour” level of the tree. You need to look at the “levels” field of the json result.
Reply | Threaded
Open this post in threaded view
|

Re: [Nanocubes-discuss] Histograms and category queries

Alex Bongiovanni
The queries, in order, were:
/query/pos=[qaddr(0,0,0),qaddr(1,1,0)]/@day_of_week=255+1

/query/pos=[qaddr(0,0,0),qaddr(1,1,0)]/@hour_of_day=<0>/@day_of_week=255+1

/query/pos=[qaddr(0,0,0),qaddr(1,1,0)]/@day_of_week=<0>/@hour_of_day=255+1

I think I understand about the tree now, and it makes a lot of sense now that you explain it, thank you.


On Fri, Feb 7, 2014 at 5:17 PM, Lauro Lins <[hidden email]> wrote:
Hi Alex,
Can you include the query URLs to the examples you just sent?

In the meantime here is a comment that might (or might not) clarify things a bit. All results in the master branch are in the form of a tree with values only at the leaves, so if you write something to drill down on hour and weekday dimensions

query/@weekday=255+1/@hour=255+1           (forgive this 255 ugly label for the root node of c1 dimensions)

every path root-leaf consists of a root node connected by an edge labeled with an weekday-address/label (e.g. Mon) connected by an edge with an hour-address/label (e.g. 9h) to a leaf node that has a value (e.g. counts of Mondays at 9h). Potentially this query could result in 24 x 7 leaves if they all have data. The order in which the addresses of children nodes are listed in the result is not sorted.

Lauro

P.S. The order of the dimensions for each layer of the resulting tree is given in the resulting json object in the field “levels”. Note that although we have written the weekday dimension first in the query above, that doesn’t mean that the “weekday” level on the tree comes before the “hour” level of the tree. You need to look at the “levels” field of the json result.

_______________________________________________
Nanocubes-discuss mailing list
[hidden email]
http://mailman.nanocubes.net/mailman/listinfo/nanocubes-discuss_mailman.nanocubes.net




--
Alex Bongiovanni
University of Maryland
Reply | Threaded
Open this post in threaded view
|

Re: [Nanocubes-discuss] Histograms and category queries

Alex Bongiovanni
Well, that did the trick, now that I understand the structure, it was simple to whip up a method to parse the tree.  As usual, when I think its the server its really my fault.


On Mon, Feb 10, 2014 at 7:58 AM, Alex Bongiovanni <[hidden email]> wrote:
The queries, in order, were:
/query/pos=[qaddr(0,0,0),qaddr(1,1,0)]/@day_of_week=255+1

/query/pos=[qaddr(0,0,0),qaddr(1,1,0)]/@hour_of_day=<0>/@day_of_week=255+1

/query/pos=[qaddr(0,0,0),qaddr(1,1,0)]/@day_of_week=<0>/@hour_of_day=255+1

I think I understand about the tree now, and it makes a lot of sense now that you explain it, thank you.


On Fri, Feb 7, 2014 at 5:17 PM, Lauro Lins <[hidden email]> wrote:
Hi Alex,
Can you include the query URLs to the examples you just sent?

In the meantime here is a comment that might (or might not) clarify things a bit. All results in the master branch are in the form of a tree with values only at the leaves, so if you write something to drill down on hour and weekday dimensions

query/@weekday=255+1/@hour=255+1           (forgive this 255 ugly label for the root node of c1 dimensions)

every path root-leaf consists of a root node connected by an edge labeled with an weekday-address/label (e.g. Mon) connected by an edge with an hour-address/label (e.g. 9h) to a leaf node that has a value (e.g. counts of Mondays at 9h). Potentially this query could result in 24 x 7 leaves if they all have data. The order in which the addresses of children nodes are listed in the result is not sorted.

Lauro

P.S. The order of the dimensions for each layer of the resulting tree is given in the resulting json object in the field “levels”. Note that although we have written the weekday dimension first in the query above, that doesn’t mean that the “weekday” level on the tree comes before the “hour” level of the tree. You need to look at the “levels” field of the json result.

_______________________________________________
Nanocubes-discuss mailing list
[hidden email]
http://mailman.nanocubes.net/mailman/listinfo/nanocubes-discuss_mailman.nanocubes.net




--
Alex Bongiovanni
University of Maryland



--
Alex Bongiovanni
University of Maryland