A friend of mine asked about incremental quantiles, which I took to mean get incremental Quantiles without doing much work .
Quantiles are an inherently controversial topic among statisticians. For evidence of this look at the quantile command in the R stats package which lists nine methods for calculating this quantity from a data set.
For those who don’t know, essentially quantiles and their various flavors: percentile, decile, quartile and so on, is a method for creating equal size buckets for data. This is how standardized exams can tell you that a particular score implies you did better than some percent of the population.
Not knowing much about the topic myself. I decided to see what opinion the authors of KDB had.
KDB does not provide a built-in quantile method. KDB does have a stats.q package which implements 4 of the 9 methods mentioned above. However, KDB does come with a built in verb called xrank.
First, let’s understand the problem a bit. The hint is that xrank is a special type of rank. Which means that quantilling is somehow related to ranking. Ranking is the most general version of quantilling. Suppose you had a list and you wanted to put each item in it’s own bucket and then label the buckets from smallest to greatest. That is exactly what rank does. Rank tells you where each item in the list belongs if it were sorted. For example:
q)rank 43 4 77 54 23 0 67 57 27 38
5 1 9 6 2 0 8 7 3 4
xrank is the special case where you only want x buckets. This is easily achieved by first ranking the list multiplying each rank by x and then using integer division by the number of elements in the list. Suppose we want 4 buckets on the previous list:
q) 4 xrank 43 4 77 54 23 0 67 57 27 38
2 0 3 2 0 0 3 2 1 1
Okay, so we see that we can easily group items in a list into roughly equal buckets. We also learned that bucketing is basically sorting and then adding little markers every length of the list divided by number of buckets.
If stars are values and vertical bars are markers. Bucketing is sorting the stars and then dropping vertical bars in between:
***********|***********|***********|***********
What is very nice about this particular opinion expressed by KDB is that no model is better than a wrong model. Specifically, the only way to bucket data is to look at all the data that you have sort it and then insert the markers at equidistant points. Depending the on distribution the buckets might be spaced evenly with respect to the values or not. KDB does not care because the values have already been sorted.
So having this wonderful tool in hand, we proceed to look at the problem of incremental quantile. Our first reaction is: Can’t be done! How can we possibly know where to put the markers if we don’t use the whole data set. Ignore that and march on.
We find that there exists some interesting things with regard to incremental sorting. This is a possible avenue, but it seems complicated.
In cases like these, I like to call upon my genie. My genie can somehow solve the problem so quickly that the method he/she(it’s a gender neutral genie) uses is irrelevant. What does my genie return to me, well the genie tells me into which bucket I should place my current value, it does this using all the data I have up till this point (My genie doesn’t time travel or at least she/he doesn’t tell me about it).
Now I proceed with the following assumptions. If the data was negligibly small, then I don’t need a genie I can just use xrank as above. However, if the data is relatively large much larger than the number of buckets, then with a very small error you can become the genie.
You will need:
2 lists
Yep that’s it! Everything else will be provided to you.
- We start by using our good old xrank on all the data until the data is too large.
- We use the first list to find the “You must be this high to be in this bucket” for each bucket. These are our partition markers.
- We use a second list to keep track of the total counts in each bucket. Our counts (kind of sounds like accounts).
Every time a new value comes in we find it’s bucket using the bin verb. Which applies binary search on the list to it’s left and returns the bucket number for the input to it’s right. For example:
q)0 5 10 20 bin 17
2
Because 17 is less than 20. So it goes in the previous bin whose minimum height is 10.
We update the counts to say that the current bucket has another item.
Here is where the magic happens: We multiply the current count against the number of buckets, if that number is higher than the total number of observations by some percent we rebucket. Otherwise if it is within our error band, we are ready for the next value.
One final note, if our error band is very small, this will have to do the sort operation much more often. However, the law of large numbers is on our side, that is as long as we are sampling from the same distribution the buckets will be right, as soon as the distribution changes the buckets will be wrong and we will rebucket. The sensitivity to a change is distribution is determined only by the error band.
I really like this approach, because it doesn’t actually care about how you get the values, all that matters is the values you have access to. Your buckets are your strongest bet on where things should land, if a bucket gets heavier than the rest, you become suspicious of your bet.
At the bottom is all the code that implements this logic.
//optimbucket is an optimized bucketing strategy
// it trades accuracy on buckets for speed
// by keeping track of how equally buckets are sorting data
//it takes a dictionary with the following keys:
//key t total observed elements
//key n number of buckets
//key bkts an ascending list of floors of each bucket,
// length should be n
//key e acceptable error before expensive recaluculation of buckets
//key c totals for each bucket,
// length should be n
//key s source for complete list of data in case of recalculation
//key b the bucket of the current element
// (could be empty)
// This is where the result is returned
optimbucket:{[d;v] d[`t]+:1; $[(d[`t]*d[`e]+1)>d[`n]*d[`c;d[`b]:|[0;d[`bkts] bin v]]+:1; d;fallback[d;v]]}/helper functions:
/redobuckets gets the buckets from the actual source
redobuckets:{[d] g:value asc {(min x; count x)} each s group xrank[d`n;s:get d`s]; d[`bkts]:`s#g[;0]; d[`t]:sum d[`c]:g[;1];d}
/fallback is our method if buckets are wrong
fallback:{[d;v] d:redobuckets[d]; d[`b]:d[`bkts] bin v;d}
/Test this
//append and bucket, gets a value adds it to s the store
/and returns the bucket
apbk:{store,::x;`d set optimbucket[get `d;x]; (get `d)[`b]}/gen generates data with pseudo normal distribution around 0
/x 10 uniform is usually enough
gen:{neg[(x%2)]+(sum x?x)%x}/setup store for data
store:10?1.0
/setup dictionary
/we will do quartiles so n is 4
d:`t`n`bkts`e`c`s`b!(0;4;til 4;.01;0 0 0 0;`store;0N)
/intialize
d:redobuckets[d]
/gen some data and watch how d changes
data:gen each 1000#10
constantdata:1000#1
uniformdata:1000?1.0
apbk each data
d
apbk each constantdata
d
apbk each uniformdata
========================================
sample output
========================================
q)d
t | 10
n | 4
bkts| `s#0.01867974 1.218468 6.321394 8.083308
e | 0.01
c | 3 2 3 2
s | `store
b | -1
q)apbk each data
0 0 0 0 1 0 1 1 2 2 0 2 0 0 1 1 2 1 1 2 1 0 0 0 1 1 0 1 2 2 0 2 2 1 0 0 3 0 2..
q)d
t | 1010
n | 4
bkts| `s#-3 -1.1 -0.5 0.2
e | 0.01
c | 254 252 253 251
s | `store
b | 0
q)apbk each constantdata
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3..
q)d
t | 2010
n | 4
bkts| `s#-3 -0.5 1.9 2
e | 0.01
c | 501 501 501 507
s | `store
b | 3
q)apbk each uniformdata
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1..
q)d
t | 3010
n | 4
bkts| `s#-3 0.04688868 0.6175779 2
e | 0.01
c | 748 757 758 747
s | `store
b | 2