Incremental Quantiles/Percentiles with KDB/Q

A friend of mine asked about incremental quantiles, which I took to mean get incremental Quantiles without doing much work .

Quantiles are an inherently controversial topic among statisticians. For evidence of this look at the quantile command in the R stats package which lists nine methods for calculating this quantity from a data set.

For those who don’t know, essentially quantiles and their various flavors: percentile, decile, quartile and so on, is a method for creating equal size buckets for data. This is how standardized exams can tell you that a particular score implies you did better than some percent of the population.

Not knowing much about the topic myself. I decided to see what opinion the authors of KDB had.

KDB does not provide a built-in quantile method. KDB does have a stats.q package which implements 4 of the 9 methods mentioned above.  However, KDB does come with a built in verb called xrank.

First, let’s understand the problem a bit. The hint is that xrank is a special type of rank. Which means that quantilling is somehow related to ranking. Ranking is the most general version of quantilling. Suppose you had a list and you wanted to put each item in it’s own bucket and then label the buckets from smallest to greatest. That is exactly what rank does. Rank tells you where each item in the list belongs if it were sorted. For example:

q)rank 43 4 77 54 23 0 67 57 27 38
5 1 9 6 2 0 8 7 3 4

xrank is the special case where you only want x buckets. This is easily achieved by first ranking the list multiplying each rank by x and then  using integer division by the number of elements in the list. Suppose we want 4 buckets on the previous list:

q) 4 xrank 43 4 77 54 23 0 67 57 27 38
2 0 3 2 0 0 3 2 1 1

Okay, so we see that we can easily group items in a list into roughly equal buckets. We also learned that bucketing is basically sorting and then adding little markers every length of the list divided by number of buckets.

If stars are values and vertical bars are markers. Bucketing is sorting the stars and then dropping vertical bars in between:

***********|***********|***********|***********

What is very nice about this particular opinion expressed by KDB is that no model is better than a wrong model. Specifically, the only way to bucket data is to look at all the data that you have sort it and then insert the markers at equidistant points. Depending the on distribution the buckets might be spaced evenly with respect to the values or not. KDB does not care because the values have already been sorted.

So having this wonderful tool in hand, we proceed to look at the problem of incremental quantile. Our first reaction is: Can’t be done! How can we possibly know where to put the markers if we don’t use the whole data set. Ignore that and march on.

We find that there exists some interesting things with regard to incremental sorting. This is a possible avenue, but it seems complicated.

In cases like these, I like to call upon my genie. My genie can somehow solve the problem so quickly that the method he/she(it’s a gender neutral genie) uses is irrelevant. What does my genie return to me, well the genie tells me into which bucket I should place my current value, it does this using all the data I have up till this point (My genie doesn’t time travel or at least she/he doesn’t tell me about it).

Now I proceed with the following assumptions. If the data was negligibly small, then I don’t need a genie I can just use xrank as above. However, if the data is relatively large much larger than the number of buckets, then with a very small error you can become the genie.

You will need:
2 lists

Yep that’s it! Everything else will be provided to you.

  1. We start by using our good old xrank on all the data until the data is too large.
  2. We  use the first list to find the “You must be this high to be in this bucket” for each bucket. These are our partition markers.
  3. We use a second list to keep track of the total counts in each bucket. Our counts (kind of sounds like accounts).

Every time a new value comes in we find it’s bucket using the bin verb. Which applies binary search on the list to it’s left and returns the bucket number for the input to it’s right. For example:

q)0 5 10 20 bin 17
2

Because 17 is less than 20. So it goes in the previous bin whose minimum height is 10.

We update the counts to say that the current bucket has another item.

Here is where the magic happens: We multiply the current count against the number of buckets, if that number is higher than the total number of observations by some percent we rebucket. Otherwise if it is within our error band, we are ready for the next value.

One final note, if our error band is very small, this will have to do the sort operation much more often. However, the law of large numbers is on our side, that is as long as we are sampling from the same distribution the buckets will be right, as soon as the distribution changes the buckets will be wrong and we will rebucket. The sensitivity to a change is distribution is determined only by the error band.

I really like this approach, because it doesn’t actually care about how you get the values, all that matters is the values you have access to. Your buckets are your strongest bet on where things should land, if a bucket gets heavier than the rest, you become suspicious of your bet.

At the bottom is all the code that implements this logic.

//optimbucket is an optimized bucketing strategy
// it trades accuracy on buckets for speed
// by keeping track of how equally buckets are sorting data
//it takes a dictionary with the following keys:
//key t total observed elements
//key n number of buckets
//key bkts an ascending list of floors of each bucket,
// length should be n
//key e acceptable error before expensive recaluculation of buckets
//key c totals for each bucket,
// length should be n
//key s source for complete list of data in case of recalculation
//key b the bucket of the current element
// (could be empty)
// This is where the result is returned
optimbucket:{[d;v] d[`t]+:1; $[(d[`t]*d[`e]+1)>d[`n]*d[`c;d[`b]:|[0;d[`bkts] bin v]]+:1; d;fallback[d;v]]}

/helper functions:
/redobuckets gets the buckets from the actual source
redobuckets:{[d] g:value asc {(min x; count x)} each s group xrank[d`n;s:get d`s]; d[`bkts]:`s#g[;0]; d[`t]:sum d[`c]:g[;1];d}
/fallback is our method if buckets are wrong
fallback:{[d;v] d:redobuckets[d]; d[`b]:d[`bkts] bin v;d}
/Test this
//append and bucket, gets a value adds it to s the store
/and returns the bucket
apbk:{store,::x;`d set optimbucket[get `d;x]; (get `d)[`b]}

/gen generates data with pseudo normal distribution around 0
/x 10 uniform is usually enough
gen:{neg[(x%2)]+(sum x?x)%x}

/setup store for data
store:10?1.0
/setup dictionary
/we will do quartiles so n is 4
d:`t`n`bkts`e`c`s`b!(0;4;til 4;.01;0 0 0 0;`store;0N)
/intialize
d:redobuckets[d]
/gen some data and watch how d changes
data:gen each 1000#10
constantdata:1000#1
uniformdata:1000?1.0
apbk each data
d
apbk each constantdata
d
apbk each uniformdata
========================================
sample output
========================================
q)d
t | 10
n | 4
bkts| `s#0.01867974 1.218468 6.321394 8.083308
e | 0.01
c | 3 2 3 2
s | `store
b | -1
q)apbk each data
0 0 0 0 1 0 1 1 2 2 0 2 0 0 1 1 2 1 1 2 1 0 0 0 1 1 0 1 2 2 0 2 2 1 0 0 3 0 2..
q)d
t | 1010
n | 4
bkts| `s#-3 -1.1 -0.5 0.2
e | 0.01
c | 254 252 253 251
s | `store
b | 0
q)apbk each constantdata
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3..
q)d
t | 2010
n | 4
bkts| `s#-3 -0.5 1.9 2
e | 0.01
c | 501 501 501 507
s | `store
b | 3
q)apbk each uniformdata
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1..
q)d
t | 3010
n | 4
bkts| `s#-3 0.04688868 0.6175779 2
e | 0.01
c | 748 757 758 747
s | `store
b | 2

 

Advertisements

Prototype for Fuzzy Grouping

This is a sketch of how you can take user input fields from a survey with free text and try to do some kind of grouping of the data. I will describe my strategy but first a warning. This code should not really be published, but the idea could be interesting.

WARNING: The following code is

  • not performance optimized, AT ALL
  • makes use of global variables as a hack
  • generates different quality groups based on the random order of the input list

Alright, you’re still reading.

I came across this problem when dealing with classification of service tickets. Each service ticket represents a unique real world event and it would be nice to classify the events. In particular we are interested in finding a group of incidents that are in some way common such that we can avoid them in the future (optimistic little creatures we are). Each of these tickets comes with a free-text field that describes the resolution, in a few sentences. Humans are pretty lazy, so they are often copying from previous resolution descriptions, which at least makes it easier to group similar tickets, (of course it leaves open the possibility that the resolution description is irrelevant to the incident). However, humans are also careless so when they copy and paste, they may leave off some parts, or simply type something similar.

My strategy was to use the fuzzywuzzy library kindly developed and open-sourced by seatgeek. The library has several useful features.

Firstly it can find the ratio between two strings and it has several methods available. One has to know a little about their domain to determine which strategy to use. For example:

fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100
fuzz.token_sort_ratio(\
  "New York Mets vs Atlanta Braves", \
  "Atlanta Braves vs New York Mets") ⇒ 100
fuzz.token_set_ratio(\
 "mariners vs angels",\
 "los angeles angels of anaheim at seattle mariners") ⇒ 90

Secondly, given a particular strategy and a list of words fuzzy wuzzy can try to extract the best inexact matches and it can use both a hard limit on the number of items to return and a minimum ratio to be considered a match.

Given a dataset, pandas and fuzzywuzzy we can group some data.

Imports:

import pandas as pd
import numpy as np
import difflib
from fuzzywuzzy import fuzz, process

First let’s clean the data a bit:

df = pd.read_csv(“dataset.csv”) df.userinput=df.userinput.apply(fuzz.utils.asciidammit)

We need to setup the list of items to group: choices = list(df.userinput) Unfortunately, this list is a global variable so that as we progress choices that have been added to the groups previously don’t get added again. In theory, you can get a serious performance increase from removing them from the query, but that requires adding an extra bit of branching logic in the extract function. (maybe later)

Next, we create an extract function that will take only a token and return a Boolean array that masks those elements not in it’s group. This way we can take advantage of pandas apply functionality. We also remove from the choices variable those that we have already grouped.


def extract(s): global choices res = process.extractBests(s, choices, \ scorer=fuzz.token_sort_ratio, score_cutoff=80 , limit=100) r = [x[0] for x in res] choices = [x for x in choices if x not in r] return list(df.userinput.isin(r))

Now we create our masks

masks = df.userinput.apply(extract)

We can then use our mask to filter and see all the elements that are similar to a particular item. For example, this gives us what matched the first row:

df[mask[0]]

Here is how we can see how many are in each group:

masks.apply(np.count_nonzero)

Finally, I present the complete code that includes a method for running on small subsets of the original set until some portion of choices is covered.

choices = list(df.userinput)
def extract(s):
 global choices
 res = process.extractBests(s, choices, scorer=fuzz.token_sort_ratio, score_cutoff=80 , limit=10000)
 r = [x[0] for x in res]
 choices = [x for x in choices if x not in r]
 return list(df.userinput.isin(r))
i=1
ratio = 1/4.0
while len(choices)>ratio * len(choices):
 i=i*2
 choices
 choices = list(df.Resolution_Text_ascii)
 masks = df.Resolution_Text_ascii[:i].apply(extract)
masks.apply(np.count_nonzero);

Also here is a possible branching version, but not as clean as I would
like it where it would return the group it’s a member of.

def extract(s):
global choices
if s in choices:
res = process.extractBests(s, choices, scorer=fuzz.token_sort_ratio, score_cutoff=80 , limit=10000)
r = [x[0] for x in res]
choices = [x for x in choices if x not in r]
return list(df.userinput.isin(r))
else:
return list()

Async Calls in KDB/Q

Sometimes you don’t need a long post to explain something relatively complicated.

KDB/Q has something called inter-process communcation (IPC) it allows two different independent KDB/Q processes to communicate. This is built-in so we don’t need to dig deep.

Usually every result in KDB/Q returns synchronously, meaning a verb[noun] will immediately be processed and no other work can get done during this time.

However, in general it is much more useful to allow to processes to talk asynchronously, like email as opposed to a phone call. IE get it done when you can please.

KDB/Q provides such a construct, but it’s still a bit tricky. So here is a version that works.

Start a KDB session with the -P or run \P in the session followed by a port number (5000 is the convention).   This session is now listening on that number.

In a different terminal start a session and create the handle that will talk to that listener. Assume we are listening on port 5001, we can use either the hostname or ipaddress, if we leave it blank it will assume localhost.

h: hopen `:hostname:port

If we use neg[h] “something that will run” it will send it asynchronously, but we won’t get a response with the result of what happened. So…

I made a very simple utility that lets the process you send to reply asynchronously with a result and that variable will get set.

async:{[h;v;n;r] neg[h] ({neg[.z.w] (x set y)}[r;v[n]])}

The verb takes a handle (h), a verb (v), a noun (n) and a result symbol (r). It uses the handle to send an asynchronous request. The request asks the other process to apply the verb to the noun and send back the result and set it to the result symbol.

Here is a really simple example:

q)async[h;{x*x};4;`test]
q)test
16

We can also project this verb so that it creates a verb that has an (i)mplicit handle and result.

iasync:async[h;;`result]

iasync [v; n]

That’s it. Enjoy.

 

UPDATE: I have found a more useful async is one that allows passing in an arbitrary function. This allows us to use the words set to perform that task done here.

the new async looks like this:

async:{[q;cb] neg[.z.w] (cb,value q)}

It simply says place the query into a list of it’s own and the cb into a list of it’s own. the callback will be called with the value of the query evaluated on the service.

For example:

neg[h] (async;({0N!x+x};2);(set;`a))

Will set `a on your service to the value of the query in this case 4. and display 4 on the service.  We can also define a function on our service that takes some number of arguments fill them in now and have the function be called with the result when the callback is executed.

f:{0N!2+x+y}
q)neg[3i] (async;({0N!x+x};2);(`f;3))
7

 

 

K for Good Thought: [Part 3] k,q,kdb tutorial adverbs, operators

Now we come to one of the most innovative features in APL family of languages. That is the language feature of operators.

An operator is similar to a verb (function) but unlike a verb which produces a noun an operator always produces a verb. It’s functions is very similar to an adverb in English.

OED

Adverb
A word or phrase that modifies the meaning of an adjective, verb, or other adverb, expressing manner, place, time, or degree (e.g. gently, here, now, very). Some adverbs, for example sentence adverbs, can also be used to modify whole sentences.

I will from now on refer to operators as adverbs so that our understanding can be enhanced by our natural understanding of language. An adverb serves to help us understand how a verb will modify a noun.

For example, in the previous tutorial we saw that we can create a verb like blend which applies to some arguments. Let’s simply define blend as taking two nouns and squashing them together. For instance the blend of cat and dog is simply catdog. In fact, we have seen this type of operation earlier with the comma(,) which allowed us to append the right argument to the left argument.

q)5# reverse til 15
14 13 12 11 10

q)list1: til 10
q)list2: 5# reverse til 15
q)list3:list1,list2
q)list3
0 1 2 3 4 5 6 7 8 9 14 13 12 11 10

list3 is now the longer list that contains both list1 and list2, order is preserved. So list1 will come before list2 and all the elements in list2 will still be in their original order. So to make this a bit more interesting, let’s say that blend does something a bit different. It adds a hyphen between the two arguments.

So blend[“cat”;”dog”] will create cat-dog. Here is a simple definition for blend.

blend: {x,"-",y}

We see that blend takes two implicit arguments, x and y, adds a hyphen in between and returns one single list.

What if instead of blending two items, we wanted to blend more. We could of course create a new function called blend3 that blends three items, but K/Q is much too cool for that instead it provides an adverb over in Q and “/” in K.

So for instance in Q:

 q)blend over ("cat";"dog";"mouse") 
"cat-dog-mouse"

in K:

k)blend/("cat";"dog";"mouse")

So we have generalized our function from two arguments to a list. What is happening is that blend is used with the first item from the list as the first argument and the second item as the second argument. Then the result is used as the first argument and the next item on the list is used as the second. This continues until there are no more items.

To get the intermediate results we can use scan instead of over in Q and \ instead of / in K.

Here is what that looks like:

q)blend scan ("cat";"dog";"mouse"; "horse")
"cat"
"cat-dog"
"cat-dog-mouse"
"cat-dog-mouse-horse"

Now we can generalize this further. By creating a verb that is similar to what python calls join. In python the join verb takes a list of items and adds a string separator in between each item in the list. This is often needed to make comma separated lists.

Here is K/Q implementation:

join:{y,x,z}

This verb takes three arguments, the first arguments will be placed between  the next two arguments.

q)join["!!!! ";"hello";"world"]
"hello!!!! world"

But if we use the projection idea from before which creates another function by supplying some of the arguments. Then we can easily see how to make blend from join.

blend: join ["-";;]

so we can simply write:

q)join["-"] over ("cat";"dog";"horse")
"cat-dog-horse"

OR:

q)join[","] over ("cat";"dog";"horse")
"cat,dog,horse"

We are starting to see sentences forming in K/Q. We read sentences in English from left to right and we do the same in K/Q. The sentences are actually evaluated in right to left order. So it is correct to read it as ‘join a comma over the list’ even though K looks at it  the other way.

It’s sort of like if you read the sentence

Lily jumped high over the yellow hedge.

We don’t know what she jumped over until the end yet we understand the sentence.

If we were to stage the sentence we would first need a yellow hedge then we need Lily and ask her to jump high. The staging of the sentence is how the computer interprets things, but we humans like to read sentences the first way.

This gets to a fundamental point about K/Q there are no precedence rules. The only slight exception is that you can use parentheses to  force pieces of a sentence to be evaluated together, much like you would do in English with hyphenation.

For instance in this phrase:

an off-campus apartment

We could say:

an apartment that is off campus

Then we would have the same meaning without hyphens and usually if we have a choice at least in Q/K we try to rearrange things so that we don’t need to use parentheses as keeping track of how many were opened and closed is hard on the eyes and forces the reader to scan the line many times.  As Donald Knuth said “Programs are meant to be read by humans and only incidentally for computers to execute.”  I will have another post on K/Q style, so back to our topic.

There are only six adverbs in K/Q and a user can not make their own but they are pretty powerful and can be used in many combinations. (Technically if you add your own functions into the q.k script you can create verbs that behave very similarly to adverbs. It is not recommended unless you are an advanced user and have found some particularly useful adverbs to add to the language.)

Let’s go through them and how they work.adverbs

Continue reading