Another Dependency Management Scheme

This post is based on an accumulator pattern I encountered when converting recursive functions to be tail recursive.

For a long time I have been looking for a way to cleanly handle errors that arise in functions. It’s well know among programmers that only about 20% of code does any real work, the rest is there to take care of  IO and error conditions. That’s one of the afterthoughts that Joe Morasco has in How to learn a new programming language.

It takes me a few hours to recall how to make the linked list
work and get something on the air. To allow someone else to play the
game unsupervised (which means doing some error handling) usually
takes me a few days of programming.

KDBQ lets you omit a lot of ceremony and boilerplate so you can get to work faster. Often a function that does something useful, ie real work, is a few lines. However, you still need to do the error handling somewhere and it’s best if we can keep our functions doing mostly work. If there are specific error conditions that need to be handled then by all means handle it, otherwise assume that the input you get is what you expect. (I know this sounds crazy to all those paranoid defensive programmers, but frankly you might like where this post goes.)

The only ceremony that I insist on is that every function you write takes a dictionary as the input. I find this to be pretty liberating in general as you can easily add functionality over time without breaking an API and it allows for the function to implement sensible default values while allowing customization.

What I eventually settled on was something that would allow me to create dependent functions by ordering them.

I think of them as “getters.” Each function “gets” a value using a dictionary of configurations.

A list of getters can be easily seen on a page and read. If you know what each getter is doing then you can keep reading, otherwise you can go look at the definition.

I then would create a getter dictionary, each key in the getter dictionary will eventually be assigned a value. The value is either the value that was already in the input or the value of applying the current getter to the dictionary.

To make this a bit clearer, suppose that I want to create a function that gives my pet animal a name.

I have the following dependencies:

What pet animal am I getting? Turtle, Elephant, Dog, Snake

getAnimalType:{[args] first 1?`Turtle`Elephant`Dog`Snake}

Is the animal a boy or a girl? (obviously turtles and elephants are more likely to be girls)

getBoyorGirl:{[args] 
   d:`Turtle`Elephant`Dog`Snake!0.7 0.7 0.5 0.5;
    $[first d[args[`AnimalType]]>1?1.0;`girl;`boy]}

Choose a name based on those two criteria.

getName:{[args] $[`boy~args[`BoyorGirl];`Kunee;`Kuni]} 
/(we like the name IPA transcription kʊniː)

So here is the basic idea: We combine these three separate verbs into a larger verb:

getPetAnimalName:{[args] 
  getters:()!();
  getters[`AnimalType]:getAnimalType;
  getters[`BoyorGirl]:getBoyorGirl;
  getters[`Name]:getName;
  :setIfMissing /[args;key getters; value getters]}

We can read this function by looking at what keys are being set in the getters dictionary. The keys tell you what this function plans on using or setting; the values tell you how the function will set them if the key is missing in the dictionary you pass in.

So for example if you don’t know what you want you can simply pass an empty dictionary to getPetAnimalName which will give you back something like:

q)getPetAnimalName[()!()]
AnimalType| Turtle
BoyorGirl | girl
Name | Kuni

Now suppose you are set on a particular pet animal you can pass a dictionary that has the animal preset.

q)presetAnimalDict:enlist[`AnimalType]!enlist[`Elephant]
q)getPetAnimalName[presetAnimalDict]
AnimalType| Elephant
BoyorGirl | girl
Name | Kuni

So far we have simply seen that this construct of setIfMissing allows us to push in default arguments and have default verbs that use our preset arguments. Now we can also show that we also get basic error handling for free.

Suppose that we break the function that chooses the animal type. We would expect that the function will not return boy or a girl if we pass nothing, but should work as usual if we do pass it an animal type.

/break function getAnimalType
q)getAnimalType:{[args] break;first 1?`Turtle`Elephant`Dog`Snake}
q)getPetAnimalName[()!()] /we get back as much as was possible 
Name | Kuni
q)getPetAnimalName[presetAnimalDict] /works as before
AnimalType| Elephant
BoyorGirl | girl
Name | Kuni

Mechanics of how the verb setIfMissing operates:

The premise of this verb, as its name suggests, is that it sets the key in the dictionary that was passed in if that key was missing. If that key exists, it does nothing. To set a key we call the function that was passed in the dictionary we have so far. This allows functions to depend on values from previous functions or from arguments that are passed in. If a function throws an error, it catches the error and tries to set the next key that is given to it.

I have created essentially two different versions of this verb. They follow two different philosophies.

Simple Version: with primitive error handling and logging

The first version is the most simple and just logs what keys it is setting and what keys it failed to set. Here it is, pretty short:

log:{neg[2] (string .z.Z)," : ",x,"\t",y;}
setIfMissing:{[a;k;f]  log["INFO";"setting key: ",string k];
  .[{[a;k;f]$[ k in key a;a; @[a;k;:;f[a]]]};
    (a;k;f); 
     {[a;k;e] log["ERROR";
   "Fail to set key: ",(string k)," Error: ",e]];a}[a;k]]};

Here I have fully elucidated every step:

/first we create a logging utility
.log:{
 neg[2] /this is the handle to standard error 
 (string .z.Z) /we convert the time into a string
 ," : ",x,"\t",y; / and append the type of error to the message
 };
/so for example 
/.log["INFO";"HELLO WORLD"] 
/prints 2017.09.17T09:06:14.309 : INFO HELLO WORLD

/now we define setIfMissing
setIfMissing:{[a;k;f] /a is the dictionary,k is the key, f is the function
 .log["INFO";"setting key: ",string k]; /log that we will be setting k
 .[ /protected execution will trap any errors
 {[a;k;f] /a,k,f are the same as the beginning
 $[ k in key a; /check if k is in the dictionary
 a; /if it is return a
 @[ /general apply, allow you to modify contents of a container
 a; /a is our container or dictionary in this case
 k; /k is the key or index in the container we will modify
 :; /ammend (set) is the function we will use 
 f[a] /is the value we will set, which is the verb f applied to a
 / this is what allows f to depend on any set items in the dictionary
 ] /this ends general apply 
 ] /this ends the conditional of whether k was in a
 }; / this ends the first function in the protected execution
 (a;k;f); /These are the arguments to the protected execution
 {[a;k;e] /a is still the dictionary, k is the key and e is the error
 .log["ERROR";"Fail to set key: ",(string k)," Error: ",e]; /log error
 a} /return the dictionary, and ends the trap function
 / this allows us to chain functions even if there is an error
 [a;k] /pass in the value of a and k, 
 /otherwise the error handler only sees the error
 ] /ends the protected execution
 }; /ends setIfMissing

 

Fancier version with introspection and profiling:

The second version of setIfMissing is likely to please those defensive programmers that I mentioned earlier.  I will not go through and elucidate this entire function. Instead I will point out the salient differences in this significantly more involved version of setIfMissing.

The first difference is that there no longer is text logging. Instead the dictionary that you pass in gets updated with all of the information about what happened. There are two tables that get added simErrors and simProfiling. So for example here is what our examples from before look like:

q)getPetAnimalName[()!()]
            | ::
simErrors   | (+(,`failed)!,`AnimalType`BoyorGirl`Name)!+(,`reason)!,("break"..
simProfiling| (+(,`simKey)!,,`initialized)!+(,`time)!,,2017.09.17T09:53:24.471
q)getPetAnimalName[presetAnimalDict]
AnimalType | `Elephant
 | ::
simErrors | (+(,`failed)!,())!+(,`reason)!,()
simProfiling | (+(,`simKey)!,`initialized`AnimalType`BoyorGirl`Name`WelcomeM..
BoyorGirl | `girl
Name | `Kuni

We can look at those tables

q)getPetAnimalName[()!()][`simErrors]
failed    | reason 
----------| ------------
AnimalType| "break" 
BoyorGirl | "AnimalType"
Name      | "BoyorGirl" 
q)getPetAnimalName[()!()][`simProfiling]
simKey     | time 
-----------| -----------------------
initialized| 2017.09.17T09:55:59.166
q)getPetAnimalName[presetAnimalDict][`simErrors]
failed| reason
------| ------
q)getPetAnimalName[presetAnimalDict][`simProfiling]
simKey     | time 
-----------| -----------------------
initialized| 2017.09.17T10:09:04.062
AnimalType | 2017.09.17T10:09:04.062
BoyorGirl  | 2017.09.17T10:09:04.062
Name       | 2017.09.17T10:09:04.062

As we can see, we now get all of the logging information from the object itself. This is very powerful because it allows us to do queries against this data later to either improve performance by using the profiling table or to see where the errors are coming from.

The second feature takes advantage of the simErrors table. Here is where we become super defensive,  instead of naively trying to run every function against every key, we can preemptively fail if we know a function depends on an earlier key that failed. This is the reason, we now no longer return a name if boyOrGirl failed. Since getName explicitly depends on boyOrGirl being set, so we can premtively bail and not even bother running the query.  Therefore, you can see that reason given will either be the error message from running the function or the keys that were required. To demonstrate this a bit further, let’s add another function to our getPetAnimalName a welcome message:

getWelcomeMessage:{[args] 
  "Welcome Home, ", string args[`AnimalType]," ", args[`Name]}
getPetAnimalName:{[args] 
 getters:()!();
 getters[`AnimalType]:getAnimalType;
 getters[`BoyorGirl]:getBoyorGirl;
 getters[`Name]:getName;
 getters[`WelcomeMessage]:getWelcomeMessage;
 :setIfMissing /[args;key getters; value getters]}

Now if we run getPetAnimalName:

q)getPetAnimalName[()!()][`simErrors]
failed        | reason 
--------------| -----------------
AnimalType    | "break" 
BoyorGirl     | "AnimalType" 
Name          | "BoyorGirl" 
WelcomeMessage| "AnimalType,Name"

As we can see all keys that were required were mentioned as the reason for the failure. In particular WelcomeMessage failed because both AnimalType and Name were missing. We can do this by inspecting the function before calling it, if any of the keys in the failed column of simErrors table appear in the function, we report them and bail.

Finally, the last feature of setIfMissing is that we can pass the dictionary by name or by value. So far in all the examples we have passed the dictionary by value and gotten a new value back with some keys set. However, if we call setIfMissing with a name of a dictionary it will return the name and modify the dictionary in place.

q)presetAnimalDict
AnimalType| Elephant
q)getPetAnimalName[`presetAnimalDict]
`presetAnimalDict
q)presetAnimalDict
AnimalType    | `Elephant
              | ::
simErrors     | (+(,`failed)!,())!+(,`reason)!,()
simProfiling  | (+(,`simKey)!,`initialized`AnimalType`BoyorGirl`Name`WelcomeM..
BoyorGirl     | `girl
Name          | `Kuni
WelcomeMessage| "Welcome Home, Elephant Kuni!"

Finally here is the code for the introspecting and more defensive setIfMissing:

setIfMissing:{[a;k;f] 
 if[not `simErrors in key a[];
 simT:``simErrors`simProfiling!
 (::;([failed:()]reason:());([simKey:(1#`initialized)]time:1#.z.Z));
 $[neg[11h]~type a;
 a set a[],simT;
 a:a,simT]
 ];
 .[{[a;k;f]
 $[ k in key a;
 @[a;`simErrors`simProfiling;
 {[x;y]y@x};
 ({[k;d]d _ k}[k];{[k;d]d upsert (k;.z.Z)}[k])]; 
 [$[any c:@[0!a[`simErrors];`failed] in `$1_'s where(s:-4!last value f) like "`*";
 'raze "," sv string @[0!a[`simErrors];`failed] where c;
 res:f[a[]]
 ];
 @[a;(`simErrors;`simProfiling;k);
 {[x;y]y@x};
 ({[k;d]d _ k}[k];{[k;d]d upsert (k;.z.Z)}[k];{y;x}[res])]]
 ]};
 (a;k;f);
 {[a;k;e]  @[a;`simErrors;upsert;(k;e)]}[a;k]]};


/#### Short and Sweet Documentation:
How to use:
/Create a list of getter function that take a dictionary as inputs
getA:{[kwargs]5};
getB:{[kwargs]42};
getC:{[kwargs]kwargs[`A]+kwargs[`B]};




/Create a meta function that composes these getters
getABC:{[kwargs]
 getters:()!();
 getters[`A]:getA;
 getters[`B]:getB;
 getters[`C]:getC;
 :kwargs:setIfMissing/[kwargs;key getters;value getters]}

/Apply the meta function to either a dictionary or a name of a dictionary
emptyD:()!()
q)getABC[emptyD]
sigil | ::
simErrors | (+(,`failed)!,())!+(,`reason)!,()
simProfiling| (+(,`simKey)!,`started`A`B`C)!+(,`time)!,2017.09.14T13:44:24.05..
A | 5
B | 42
C | 47
q)getABC[`emptyD]
`emptyD
q)emptyD
sigil | ::
simErrors | (+(,`failed)!,())!+(,`reason)!,()
simProfiling| (+(,`simKey)!,`started`A`B`C)!+(,`time)!,2017.09.14T13:45:02.09..
A | 5
B | 42
C | 47
q)notSoEmptyD:`A`B!(10;14)
q)getABC[notSoEmptyD]
sigil | ::
A | 10
B | 14
simErrors | (+(,`failed)!,())!+(,`reason)!,()
simProfiling| (+(,`simKey)!,`started`A`B`C)!+(,`time)!,2017.09.14T13:47:59.24..
C | 24




/We can look at profiling information
q)emptyD.simProfiling
simKey | time 
-------| -----------------------
started| 2017.09.14T13:45:02.099
A | 2017.09.14T13:45:02.099
B | 2017.09.14T13:45:02.099
C | 2017.09.14T13:45:02.099

/We get automatic error handling and prememptive failiure
/let's add an error into getA
getA:{[kwargs]'`broken};
emptyD:()!();
q)emptyD:getABC[emptyD]
sigil | ::
simErrors | (+(,`failed)!,`A`C)!+(,`reason)!,("broken";,"A")
simProfiling| (+(,`simKey)!,`started`B)!+(,`time)!,2017.09.14T13:51:06.419 20..
B | 42
q)emptyD.simErrors
failed| reason 
------| --------
A | "broken"
C | ,"A" 
/If the value is set the meta function will still work.
q)getABC[notSoEmptyD]
sigil | ::
A | 10
B | 14
simErrors | (+(,`failed)!,())!+(,`reason)!,()
simProfiling| (+(,`simKey)!,`started`A`B`C)!+(,`time)!,2017.09.14T13:53:46.03..
C | 24




Caveats:
 The sim prefix is used in the dictionary
Advertisements

Mechanics of the As-Of-Join Primitive

Recently, I had the necessity to perform an operation that didn’t exist in KDB, but whose scent reminded me of a primitive provided in KDB – As Of Join (aj). I will sketch the problem that motivated me. Then I will describe what the As Of Join operation does and how it is implemented in KDB. Finally, I will describe the new verb I created that solves the original problem.

Primal Scenario:

We have the following tables produced by two systems in the same company. A table of user interactions and another table from an operational system recording transactions. To make this a bit more concrete, our operational system manages inventory in a particular warehouse and so has the following columns: Date,ItemID,Quantity,TimeStamp. Here are 10 sample rows and code to generate it:

/
For the sake of making things a bit narrower, 
I am truncating a GUID with the following helper functions
\
q)trunID:{`$8#string x}
q)genID:{trunID each x?0Ng}
q)sysTable:([] Date:20170805+til 10;
      ItemId:genID 10;
      Qty:10?10;
     timestamp:00:00:00+10?100000)
q)show sysTable
Date     ItemId   Qty timestamp
-------------------------------
20170805 72de67b9 2   14:27:23
20170806 02e583e3 6   23:43:36
20170807 cffa7aa8 4   07:42:14
20170808 327b47a8 1   26:28:04
20170809 d1b9ccd1 0   11:48:17
20170810 6cb019b3 2   04:59:38
20170811 bb466f80 9   17:54:00
20170812 70112cc0 6   00:54:33
20170813 c9d56ab5 0   22:19:31
20170814 96b75d97 8   10:20:35

The user interactions table keeps track of user related events. It has the following columns: UserId, Date, EventType, ItemId, Qty, timestamp. Here are 10 sample rows and code to generate it:

q)userTable:([]UserId:genID 10;
           Date:20170805+til 10;
           EventType:10#`buy;
           ItemId:genID 10;
           Qty:10?10;
           timestamp:00:00:00+10?100000)
q)show userTable

UserId   Date     EventType ItemId   Qty timestamp
--------------------------------------------------
88c5bbf5 20170805 buy       b19c6770 6   25:51:08 
42eeccad 20170806 buy       f12c9963 6   01:31:20 
c5510eba 20170807 buy       9ee93847 1   22:36:58 
5f344195 20170808 buy       7bfa7efa 8   20:50:09 
9a4f68c6 20170809 buy       24e2f268 5   16:27:03
26ba3a3e 20170810 buy       94922173 4   17:13:45 
c046e464 20170811 buy       4dec3017 9   13:16:27 
0e09baa4 20170812 buy       f5f7b90e 2   24:29:14 
a9f44e57 20170813 buy       0cb8391a 7   07:38:13 
1e1f2a49 20170814 buy       2df03fb0 0   23:10:12 
We will remove all rows that don’t have EvenType=`buy, so we can ignore this column from now on, but I left it in as an example of the kind of data cleaning you might perform. Also for the rest of this tutorial, I create a universe of itemIDs and userIds that I sample from this creates a more realistic simulation where itemIds will overlap in the two tables.  itemids:genID 1000; userids:genID 100000

We want to perform some analytics to figure out which warehouses we should use for different products based on user geography. To do this we need to join the inventory data to the user data. Unfortunately, there is nothing that links these two tables together, these systems are independent even though they share the same real world events.  You work with what you’ve got. Since these events are connected, if we see a transaction in the user table that has the same date, same ItemId and same Qty we can be pretty sure that it is the same event. This can be performed with a simple left-join like so:

/This join takes less than half a second on 1 million rows 
/in both tables
q)\t userTable lj `ItemId`Date`Qty xkey sysTable
453

However, our company actually has many, many users so that just using Date and ItemId and Qty is going to get many possible matches of which KDB will always choose the first. This is exacerbated by the fact that most times Qty is 1. So we need to incorporate the timestamp column to get better matching results.  We would like to match two rows if Date ItemId and Qty match and the timestamps are sufficiently close.  In particular, we would like to take the first match from the userTable where everything else being equal the timestamps are within 5 minutes of each other (assume that if they are outside this window, some other system must be responsible for that row).

The As Of Join Primitive for TimeSeries analysis

The idea behind As Of Join is that you can look at the state of universe in another table as of a certain time. The most common example involves joining a trades table to a quote table. I won’t belabour that example. Instead, I’ll translate this into our fictional internet store company.  I want to know what the inventory was as of a particular time when a product was ordered. I have an inventory table:

q)invTable:([]ItemId:i#itemids;
    Qty:i?1000
    DateTimeStamp: 
       (i#til 10)+(2017.08.05D00:00:00.000+i?1000000000000000))
/let's sort it in asc order of the DateTimeStamp
q)invTable:`DateTimeStamp xasc invTable
q)show invTable
ItemId   Qty DateTimeStamp                
------------------------------------------
30c86bca 375 2017.08.05D00:00:00.075460419
73d8673e 772 2017.08.05D00:00:00.114482831
2e6d75b2 166 2017.08.05D00:00:00.330689365
6c55cdcf 860 2017.08.05D00:00:00.336812816
a3715f66 364 2017.08.05D00:00:00.457884749
9abd4b9f 396 2017.08.05D00:00:00.726431612
89e2e90e 37  2017.08.05D00:00:00.795442611
c7891977 999 2017.08.05D00:00:00.817328697

Given  an ItemId and DateTimeStamp we would like to know how much of that Item we had as of that Time.

In other words, we want the Qty for that ItemId where the time is as close as possible to the given time but no greater.

To make the problem a bit more interesting, we can imagine that we have a table of all the products and we want all the inventory as of 2017.08.07D00.00.00. We can do this pretty easily:

/First generate a table with all of the products and that timestamp;
q)queryTable:([Items:itemids]DateTimeStamp:2017.08.07D00.00.00)
q)show queryTable
q)show queryTable
ItemId   DateTimeStamp                
--------------------------------------
d44f9458 2017.08.07D00:00:00.000000000
aa25ba3a 2017.08.07D00:00:00.000000000
beec200c 2017.08.07D00:00:00.000000000
6e9b637f 2017.08.07D00:00:00.000000000
70c3756c 2017.08.07D00:00:00.000000000
a8f4636b 2017.08.07D00:00:00.000000000
befc6ca7 2017.08.07D00:00:00.000000000

/
 Then we put the columns we want to join 
 in order from least frequent to most frequent
 with the time column last. 
 Then the two tables, query table first
\
q)aj[`ItemId`DateTimeStamp;queryTable;invTable]
ItemId   DateTimeStamp                 Qty
------------------------------------------
d44f9458 2017.08.07D00:00:00.000000000 143
aa25ba3a 2017.08.07D00:00:00.000000000 774
beec200c 2017.08.07D00:00:00.000000000 925
6e9b637f 2017.08.07D00:00:00.000000000 711
70c3756c 2017.08.07D00:00:00.000000000 447
a8f4636b 2017.08.07D00:00:00.000000000 190
/
If we want to see when the time is that is matching 
we can use it's sister command
aj0
\
q)aj0[`ItemId`DateTimeStamp;queryTable;invTable]
ItemId DateTimeStamp Qty
------------------------------------------
d44f9458 2017.08.06D03:43:09.477791637 143
aa25ba3a 2017.08.06D03:45:38.244558871 774
beec200c 2017.08.06D08:46:36.299622582 925
6e9b637f 2017.08.06D19:46:02.531682106 711
70c3756c 2017.08.06D07:43:15.845127668 447
a8f4636b 2017.08.06D03:44:03.353382316 190

/
We can verify that all the times are before 
 midnight on the 7th of august
\

Let’s now look at the internal mechanics of how this actually works.

First KDB finds all the candidate matches based on all the column till the last one. Then the aj verb makes use of the binary search function (bin) in KDB. Bin takes two arguments of the same type, the first is a sorted list in ascending order, the second is either a list or an item. For each second argument it returns the index that is equal or less in the first argument. A simple example will illustrate this:

q)0 1 2 3 4 5 6 7 bin 4
4
q)0 1 2 3.5 4 5 6 7 bin 3.0
2
q)0 1 2 3.5 4 5 6 7 bin 0.1 0.2 1.9 3.6
0 0 1 3

So assuming the candidate matches are sorted, picking out the latest as of the query time becomes as simple as performing binary search on each row we are trying to match against the candidate matches, which is exactly how KDB does it.

Another way of thinking about “bin” is that it returns the last element from a set of candidate matches, where match means being before the current element.

With this view in mind, it should be clear that we can’t use As Of Join to solve the Primal Scenario. Since as-of-join essentially gives you the last row that is before the time you are asking, whereas we are looking for the first row.  If we consider shifting the timestamps so that all the candidates within the tolerance are before the matching row’s timestamp we would still get the last match instead of the first one.

Luckily, there is another function called binr that returns the first index that is equal or greater to the query. Which leads us to:

q)0 1 2 3 4 5 6 7 binr 4
4
q)0 1 2 3.5 4 5 6 7 binr 3.0
3
q)0 1 2 3.5 4 5 6 7 binr 0.1 0.2 1.9 3.6
1 1 2 4

Since binr returns the first index that is equal or greater, we can think of it as returning the first match from a set of candidates where matches are determined as being greater or equal to the element being matched. This bin right function, so named because it returns the binary search from the right, is now enough to go and solve the primal scenario.

A New Verb AJR as of join right

Let’s take a quick look how aj is implemented:

k){.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)bin x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}

Sure enough there is the bin verb right in the middle of the verb.

So we can create another verb called ajr and define it like this:

k){.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)binr x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}

A quick note, the easiest way is to simply go into k mode by typing one backslash. But in a script it’s easier to just to write:

k)ajr:{.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)binr x#y;
     y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}

or

"k" "ajr:{.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)binr x#y;
     y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}"

However, you define this verb, you can now answer the question, what was the inventory like immediately after 2017.08.07D00:00:00.000000000 at the first recorded point.

q)ajr[`ItemId`DateTimeStamp;queryTable;invTable]
ItemId   DateTimeStamp                 Qty
------------------------------------------
d44f9458 2017.08.07D00:00:00.000000000 38 
aa25ba3a 2017.08.07D00:00:00.000000000 5 
beec200c 2017.08.07D00:00:00.000000000 723
6e9b637f 2017.08.07D00:00:00.000000000 783
70c3756c 2017.08.07D00:00:00.000000000 56 
a8f4636b 2017.08.07D00:00:00.000000000 112
befc6ca7 2017.08.07D00:00:00.000000000 871
c8bc95e4 2017.08.07D00:00:00.000000000 586

/Similarly we can define the ajr0 verb as follows:
/ and that way we can see what time it matched
q)k)ajr0:{.Q.ft[{d: z;$[&/j:-1<i:(x#z)binr x#y;
    y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}
q)ajr0[`ItemId`DateTimeStamp;queryTable;invTable]
ItemId   DateTimeStamp                 Qty
------------------------------------------
d44f9458 2017.08.12D15:17:13.710407838 38 
aa25ba3a 2017.08.11D22:32:05.314057246 5 
beec200c 2017.08.13D03:55:10.402670877 723
6e9b637f 2017.08.11D19:32:36.058830770 783
70c3756c 2017.08.12D18:53:29.659736235 56 
a8f4636b 2017.08.15D10:47:32.423256268 112
befc6ca7 2017.08.09D03:48:48.974271199 871
c8bc95e4 2017.08.10D12:42:14.533837072 586

 

So returning back to our primal scenario, all we need to do is add the tolerance to the userTable timestamp. Then we can use this verb to join on all the previous columns and and take the first match within the tolerance.

/
first I will create another column 
that is 5 minutes offset into the future to match on
\
q)update timestamp:timestamp+00:05,
  timestampOriginal:timestamp from `userTable
q)show userTable
UserId   Date   EventType ItemId Qty timestamp timestampOriginal
--------------------------------------------------------------------
1faaa20e 20170805 buy 992098f0 7 19:58:02 19:53:02 
a419e88a 20170806 buy 327d41d0 8 15:12:55 15:07:55 
79cdd3d7 20170807 buy bc8ff04f 0 00:38:48 00:33:48 
4155154c 20170808 buy b7318a04 3 04:21:50 04:16:50 
a22652c8 20170809 buy 9e35c5a0 7 09:01:45 08:56:45 
277f1b93 20170810 buy 8cc964a4 5 07:56:52 07:51:52 
a7966916 20170811 buy c2c5bfaf 9 22:59:05 22:54:05 

/
Then we will perform the new AS OF JOIN RIGHT
\
q)ajr[`Date`ItemId`Qty`timestamp;sysTable;userTable]
Date ItemId Qty timestamp UserId EventType timestampOriginal
--------------------------------------------------------------------
20170805 2fc6d170 5 00:50:21 a5732847 buy 07:45:49 
20170805 78bc9724 5 00:54:57 3f039eb6 buy 04:12:17 
20170805 3135e6a8 8 00:57:40 705b75d5 buy 05:03:23 
20170805 4a4b9e36 9 01:42:09 6987be4e buy 02:00:52 
20170805 a6b5e1f3 5 01:59:56 3850deb3 buy 05:53:19 
20170805 d9d0a16c 9 02:12:31 620bda64 buy 10:39:09 
20170805 7c2652a7 3 02:30:49 7af80798 buy 04:19:24 
20170805 c90c6086 5 02:42:07 d4256817 buy 04:55:00 
20170805 be19e79f 6 02:43:57 5c0d3d58 buy 09:38:14 
20170805 264e5489 9 03:36:12 82b8912f buy 04:20:57 
20170805 0ba0bfd2 0 04:01:53 4f996c1d buy 16:11:22 
20170805 cb3e189d 0 04:07:03 675cd3b7 buy 08:43:38 
20170805 f2c41bca 6 04:36:03 855525d7 buy 09:33:50 
20170805 8eec99e2 2 04:36:26 15cab5af buy 05:07:27 
20170805 34894f41 9 04:42:40 ba2ca22d buy 06:11:16 

/It will match to the first at or after the timestamp,
 since we pushed the timestamp up 5 minutes it will be from within 
the tolerance to possibly the last index if no matches.
The issue is that we can now be outside the tolerance 
on the other side, to prevent this we can simply filter
\
q)joined:ajr[`Date`ItemId`Qty`timestamp;sysTable;userTable]
/
We filter where the matches exist and 
timestamps are within the tolerance
\
q)select from joined where not null UserId,00:05>timestampOriginal-timestamp
Date ItemId Qty timestamp UserId EventType timestampOriginal
--------------------------------------------------------------------
20170805 ac2dbea6 1 07:21:21 99303d09 buy 07:20:27 
20170805 b842007c 2 11:05:40 5171565d buy 11:04:20 
20170805 2d948578 7 19:10:14 75344162 buy 19:13:26 
20170805 c061f673 5 23:20:43 40cf305a buy 23:20:55 
20170805 9396fa95 2 24:25:48 d7482537 buy 24:22:43 
20170806 a73e1906 0 04:45:55 5c7ebb13 buy 04:43:15 
20170806 49ad69e9 6 08:16:19 c6df7f37 buy 08:18:55 
20170806 931226c9 3 20:39:59 30789ba9 buy 20:34:59 
20170807 39e53bb2 9 03:43:39 ccc71cae buy 03:47:14 
20170807 40f3f6de 8 05:56:22 fc967f51 buy 05:59:30 

That’s it.

here is all the code to simulate this:

trunID:{`$8#string x};
genID:{trunID each x?0Ng};
i:10000;
itemids:genID 1000;
userids:genID 100000;
sysTable:([] Date:i#20170805+til 100;
 ItemId:i?itemids;
 Qty:i?10;
 timestamp:00:00:00+i?100000);
sysTable:`Date`timestamp xasc sysTable;
i:10000000
userTable:([]UserId:i?userids;
 Date:i#20170805+til 100;
 EventType:i#`buy;
 ItemId:i?itemids;
 Qty:i?10;
 timestamp:00:00:00+i?100000);

k)ajr0:{.Q.ft[{d: z;$[&/j:-1<i:(x#z)binr x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y};
k)ajr:{.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)binr x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y};
update timestamp:timestamp+00:05,
 timestampOriginal:timestamp from `userTable;
userTable:`Date`timestamp xasc userTable;

joined:ajr[`Date`ItemId`Qty`timestamp;sysTable;userTable]
show select from joined where not null UserId,00:05>timestampOriginal-timestamp;

A Short Interlude on KDB Load balancing

As you might know KDB is inherently single threaded. In other words, it can do at most one thing at a time in sequence.

This is a good thing. The benefits include:

  • Ordered and inherently sequential transactions that see a consistent view of the world.
  • Fewer Cache misses since data is processed to completion instead of being rescheduled.
  • Programs are easier to reason about.

This post isn’t really about how awesome being single threaded is. There are plenty of articles about that featuring Node.js and javascript. Linked here. Also plenty of stack overflow answers about the subject and an old paper about KDB.

OED

Synchronous: ADJECTIVE

  • Existing or occurring at the same time.

Asynchronous: ADJECTIVE

  • Not existing or occurring at the same time.
  • Computing Telecommunications
    Controlling the timing of operations by the use of pulses sent when the previous operation is completed rather than at regular intervals.

In general, in this single threaded world, you simply design components that do perform some function and they go out to other services that perform other functions. If you do everything over async IPC (Inter-Process-Communication). Then you are golden because all of your q processes are responding to the events as soon as they happen. In a simple example, if you have two q processes. One is responsible for holding realtime data and it’s derived/calculated values. And the other is responsible for calculating those derived values. The updater can keep updating its table and send async events to the model calculator to recalculate. When the model recalculates it can send an async request for the updater to include some values in it’s derived model.

However, perfect systems like this don’t get built. Instead, we have many q processes using synchronous IPC. There is a reason for this –though not a good one. It is significantly easier to reason about a synchronous request. In your mind you simply model all the requests one after another and then you do something with all of the requested values.

Unfortunately, if a request takes a long time to process you can force a service to spend all its time honoring your request. Continuing the example above, suppose someone didn’t understand the initial plan and decides to calculate all the values on the realtime service, that would cause the realtime service to spend more of it’s time processing and less of it’s time updating. We can of course turn off the ability to query a realtime service like this, but that is a bit like throwing the baby out with the bath water. Synchronous requests especially for a client are much easier to understand.

So let’s start with a quick analogy that will help explain what we want to do. Synchronous messages are like phone calls.

You call me, I answer the phone you ask a question, I answer.

Asynchronous messages are like text messages.

You text me. Sometime later I read the message. I might respond.

Ideally, what we would like is to hire a secretary who intercepts incoming calls answers the phone takes a message sends it as a text and keeps the caller on the line until the text is answered at which point the secretary can respond to the caller. Since this job is pretty easy we expect that we can have one secretary serving many callers and keeping them all on the line until their query has been resolved. Unfortunately, as of q version 3.5 we still cannot do this (though it has been promised in version 3.6).

So instead we will simulate this with another mode in KDBQ called multithreaded input mode. Multithreaded input mode allows you to answer queries concurrently (up to 1020) within one q process. However, it places many limitations on what kind of queries it can execute and you don’t want heavy calculations to actually execute concurrently for the reasons described earlier. So we settle for the next best thing. A single master forwards each query to a slave, the slave executes the query returns the result to the master and the master returns the result to the client. This allows us to scale a particular service without meddling with the client code. It turns out a bit of engineering is required to make this work and it does turn the Multithreaded input mode into a router, something e documentation says it is not built to do {YMMV}. ( I have tested it and it seems to work)

DISCLAIMER

This works on linux for version 3.x and on version 3.0-3.4 on mac, it uses UNIX sockets and mac doesn’t support abstract sockets. I have implemented something similar using the fifo system as well, but it has it’s own drawbacks and won’t be described here. I’m sure something similar with files can be done in windows systems, but I haven’t spent the time working it out at this point.

Architecture

You have one master process, this master is run with a negative port number, which will put it into multithreaded input mode. It also runs with a timer turned on so it will be updating and making sure it’s slaves are alive every so N milliseconds. I have set N a thousand so every 1 seconds (this is probably too often but it makes testing faster, since you can see the result of taking a slave offline within a second).

q master.q -p -5000 -t 1000

It runs with a master script called master.q

You also need to run some slave nodes that will listen on their own individual ports. For our example we will use 4001,4002,4003,4004.

q -p 4001

Now I will explain what is inside master.q

//the unix handles of the slave processes
handles:`:unix://4001`:unix://4002`:unix://4003`:unix://4004
h: {@[{hopen x};x;0Ni]} each handles;
han:handles!h;
han:(where han=0Ni)_han;
/ 
 we will define a function that updates all of
 open handles, so that none of them are stale. This will be executed
 in the main thread, which means that all of the slaves should 
 have returned by this point. So if they don't respond it means they
 are dead.
\
.z.ts:{ 
    alive:{@[{x (=;1b;1b)};x;0b]} each han;
    $[all[han] & count [han]=count[handles] ;;
      [dead:where not alive; hclose each han[key dead];
       inactive: handles where (key han) not in handles;
       h:{@[{hopen x};x;0Ni]} each (dead,inactive);
       hd:(where hd=0Ni)_hd:(dead,inactive)!h;
       `han upsert hd;]]};
/
Because each concurrent handle is unique we can assign a slave 
by binning the request handles over the number of active slaves.
In other words we are round robining over the slaves.
\
.z.pg:{[m] (first value (.z.w mod count han)_han) m};
.z.ps:.z.pg;

The reason we have to do all this work, is that any command that changes the global state is illegal and therefore can’t be in .z.pg, and though hopen isn’t illegal it pauses all of the concurrency for all the threads because it effectively changes global state.

As a consequence a serious limitation of this approach is that if a client opens a connection with the this multithreaded gateway while querying they can lock up all the other concurrent requests. In practice this is done by using the connection string query instead of hopen.

For example:

::5000 "2+2"

is equivalent to

h:hopen `::5000; h "2+2"

but the first case will lock up the multithreaded gateway and the second won’t.

Secondly this gateway does not support asynchronous requests with callbacks. That is you can call the async request but you won’t be able to make sure the host calls you back. There are probably some ugly workarounds, like including your host and port in the function.

This is a fully functional load balancer and allows clients to continue to write synchronous code without locking your servers. Technically each of these slaves can actually forward the requests onto other machines, since the slaves are not actually limited.

How the Mighty have Fallen, or How I learned to love GREP.

**Note This post was written over a two week period where I kept learning more about this problem. Many of the notes refer to later parts of the post, so that caveats can be explored. Also some parenthetical notes are almost incomprehensible without reading many of the linked articles on the page.  There is little I can do to summarize all of those articles, this post is already way too long.

This post is inspired by a friend of mine who was solving a particularly common “Big Data” problem. (I enjoyed the last two weeks immensely, and have a much greater appreciation for Ken Thompson, Andrew Gallant, Richard Stallman, and of course Arthur) 

[I use Big data loosely, it usually relates to a problem being too large for the machine that you have access too. A great essay can be found in Programming pearls Vol 1 Cracking the Oyster. Where a big data problem is one that uses more than a megabyte of memory. ]

Essentially, my friend had a data set that contained roughly 100 million rows and each row had a column that was a free text description. A human could easily read several records and identify which records belonged together in one category, but the sheer number of records prevented any such attempt to be made. On the other hand, a computer using only a small set of keywords would easily misclassify many of the records. It would be easier to create more categories than there were and then combine categories based on statistical analysis. My friend tried fuzzy matching using something I covered in an earlier post, however, the algorithm slows down drastically with larger datasets.

By the time he told me his problem he had already invented an on the fly hashing and categorization algorithm that I have still yet to fully grok (his algorithm might be faster but it is optimized for his specific problem).

Problem Statement: Assuming you have a set of rows with descriptions and a set of keyword/tokens, assign each row to one or more tokens.

In particular every token will have a set of rows that belong to it. Rows can be assigned to multiple tokens and tokens have many rows.   The total number of tokens was 1 million. The sample dataset would have 10 million rows [limited by the memory on one machine].

The first approach was to use pandas and use groupby methods. However, that was disastrously slow because Pandas strings are actually Python objects instead of being Numpy objects and are simply not optimized for speed.

This led me to move to pure Numpy. I started by constructing a toy problem. The toy problem would have a million rows. Each row would consist of only 5 randomly selected lowercase English letters separated by spaces. The goal would be to identify all the rows that contained each letter.

Here is that experiment:

In [1]:

import pandas as pd
import random
import string
import numpy as np
In [2]:
" ".join(random.sample(string.ascii_lowercase, 5))
Out[2]:
'i k a r m'
In [3]:
#million
I=1000000
tokens=list(string.ascii_lowercase)
df=pd.DataFrame({"d":[" ".join(random.sample(tokens, 5)) for i in range(I)], "x":range(I)})
In [4]:
df.head()
Out[4]:
d x
0 b c v l y 0
1 m l q u t 1
2 b f s i n 2
3 g p e k h 3
4 j k a b l 4
In [5]:
#convert the description column into a fixed size array. 
#being a bit conservative on the max size of this column is okay
d=df.d.as_matrix().astype('S1000')
In [6]:
#d is now just fixed strings
d
Out[6]:
array([b'b c v l y', b'm l q u t', b'b f s i n', ..., b'p f k n h',
       b'y p t o i', b'e g o a i'], 
      dtype='|S1000')
In [7]:
# We see 26 arrays with the indexes of the rows that have each char
list(map(lambda x: np.where(np.char.find(d, np.bytes_(x))>-1),tokens))
Out[7]:
[(array([     4,     24,     39, ..., 999967, 999990, 999999]),),
 (array([     0,      2,      4, ..., 999990, 999992, 999993]),),
 (array([     0,      9,     11, ..., 999978, 999980, 999991]),),
 (array([    12,     15,     18, ..., 999979, 999983, 999985]),),
 (array([     3,     13,     14, ..., 999991, 999995, 999999]),),
 (array([     2,      7,     13, ..., 999980, 999989, 999997]),),
 (array([     3,      6,      7, ..., 999992, 999994, 999999]),),
 (array([     3,      6,     10, ..., 999987, 999993, 999997]),),
 (array([     2,      8,      9, ..., 999993, 999998, 999999]),),
 (array([     4,      5,     10, ..., 999990, 999993, 999994]),),
 (array([     3,      4,      5, ..., 999988, 999996, 999997]),),
 (array([     0,      1,      4, ..., 999984, 999987, 999996]),),
 (array([     1,     12,     15, ..., 999964, 999975, 999979]),),
 (array([     2,      9,     10, ..., 999985, 999989, 999997]),),
 (array([    16,     21,     25, ..., 999996, 999998, 999999]),),
 (array([     3,      6,     12, ..., 999992, 999997, 999998]),),
 (array([     1,     15,     20, ..., 999986, 999988, 999989]),),
 (array([     5,     13,     17, ..., 999974, 999984, 999989]),),
 (array([     2,     14,     18, ..., 999990, 999994, 999996]),),
 (array([     1,      7,     19, ..., 999985, 999994, 999998]),),
 (array([     1,      7,      8, ..., 999993, 999994, 999995]),),
 (array([     0,     10,     11, ..., 999974, 999983, 999988]),),
 (array([     6,      9,     18, ..., 999987, 999990, 999991]),),
 (array([    17,     25,     28, ..., 999988, 999992, 999995]),),
 (array([     0,      5,      7, ..., 999991, 999995, 999998]),),
 (array([     8,     17,     18, ..., 999976, 999995, 999996]),)]
In [8]:
%%timeit
list(map(lambda x: np.where(np.char.find(d, np.bytes_(x))>-1),tokens))
 1 loop, best of 3: 22.5 s per loop

The toy problem took 22.5 seconds on my core i7 desktop from 2015. All the benchmarks were done on the same machine to prevent skewing of results.

I decided to abandon Numpy. 22 seconds for 26 tokens would not scale. At this rate finding the matches for a million tokens would take  (1 million / 26) * 22 seconds = 9.79344729 days. Multiplying by another 10 because my problem was only on 1 million rows would mean this problem was going to run for 90 days at least, you can’t even return most clothes after that much time.

So I turned to my favorite language optimized for big data. KDBQ.

Here is basically the same code as Numpy with comments inline:

.Q.a is just all the lowercase ascii letters

/create a function g that will generate the random string
g:{raze ” “,’5?.Q.a}
/create a table with a million rows using g and column d
t:([]d:g@’1000000#0)
/demonstrate the function on one letter “a” for the first 10 rows.
/This creates a boolean mask where 1 means that the row contains “a”
{first `boolean$ss[x;”a”]} each 10#t.d
0000000010b
/use where to save only the rows that have the letter
where {first `boolean$ss[x;”a”]} each 10#t.d
,9
/combine this function and apply it for each letter returning 26 rows with the indexes
/for the first 10 rows
{where {first `boolean$ss[y;x]}[x] each 10#t.d} each .Q.a
,9
2 4
1 3
`long$()
1 5 9
3 7 8
`long$()
,4
5 8 9
,7
1 7 9
,1
`long$()
5 8
,0
6 9
4 5 6
0 2 3
,2
`long$()
`long$()
0 2
..
/Time this function for a million rows
q)\t {where {first `boolean$ss[y;x]}[x] each t.d} each .Q.a
5335
/Try to parallelize the problem with 6 slave threads using peach
q)\t {where {first `boolean$ss[y;x]}[x] each t.d} peach .Q.a
3104
/Try to further parallelize the problem, no luck
q)\t {where {first `boolean$ss[y;x]}[x] peach t.d} peach .Q.a
3087

KDB has done a lot better than Numpy on the same problem, BUT this problem will still take: 1.33547009 days * 10 because this was only a million rows. We get 10 days, which is more reasonable. But we can do better.

I decided to take advantage of the KDBQ enumerated string AKA symbol type.

This led to the following code and timings:

q)g:{raze `$’5?.Q.a}
q)t:([]d:g@’1000000#0)
q)first t[`d]
`f`j`r`p`j
q)tokens:`$’.Q.a
q)\t {where {x in y}[x] each t.d} each tokens
3796
q)\t {where {x in y}[x] each t.d} peach tokens
2218
q)\t {where {x in y}[x] peach t.d} peach tokens
2288

Definitely an improvement, but nothing to write home about.

At this point, it was time to start looking to improve the algorithm. A quick StackOverflow search lead me to the name of the problem. MultiString/MultiPattern Search. There are two dominant algorithms: Aho-Corasick and Rabin-Karp. Aho-Corasick was developed by Aho of AWK fame and Margaret Corasick (not much is written on the internet about Margaret Corasick, except that she was clearly one of the leading female pioneers in computer science). The Algorithm is linear in the number of inputs and the number of tokens and the number of matches. The Rabin-Karp algorithm, in true Rabin style, is technically not linear but average case linear assuming a good hashing function is chosen and it involves using prime numbers so that you can roll through the string without looping back, more here.

I was going to start implementing one of these algorithms, but as I was searching for a good implementation, I found that the reference implementation was to be found in GNU Grep. So out of curiosity, I tried running my toy problem using grep.

I used KDBQ to write out my table to t.txt and deleted the first line so that I just had the rows of strings on each new line.  I then saved the another file called tokens.txt where each new line was an ascii letter. I then constructed the necessary grep to get the line numbers where a token appears.

grep -n token filename

999918: x b w a b
999923: a v u p u
999924: y v i a p
999927: c k j a g
999930: a l n i p
999935: p a b e k
999944: f a a n t
999949: b u a h p
999951: n j x u a
….

This will give you the line numbers and the matches using cut command you can return only the line numbers.

Here is what that looks like:

$ grep -n “a” t.txt | cut -d: -f1

999918 
999923
999924
….

I then decided to go read the tokens.txt file line by line and grep for the line in the table of a million rows. Each result of line numbers is written to a file in the results folder with that token name. So if we want all the rows that match a token we simply look at the row numbers in that file.

time while read LINE; do grep -n $LINE t.txt| cut -d: -f1 > results/$LINE; done <tokens.txt

real 0m0.749s
user 0m0.852s
sys 0m0.044s

Less than a second and we haven’t even parallelized the operation. Adding & so that things run in the background and waiting for the process to finish we see the toy problem for what it is, A TOY.

$ time $(while read LINE; do grep -n $LINE t.txt| cut -d: -f1 > results/$LINE & done <tokens.txt; wait)

real 0m0.216s
user 0m1.424s
sys 0m0.112s

With parallelization the toy problem takes less than a fifth of a second. Estimating the run time for the original problem we see that it will now take: (1 million/26)*.216 seconds = 2.3 hours * 10 this is now in the less than a day category.

Curious and surprised that grep was reading off the disk faster than KDBQ was reading from memory, I decided to find out why grep was so FAST.

Summarizing one of the original authors of GNU grep:

  • Like KDBQ, GREP memory maps files
  • Like KDBQ grep will often use the Boyer-Moore search algorithm that basically allows quick skipping of bytes where no match is possible.
  • Grep avoids looking at the lines and only reconstructs line breaks if there is a match

Mike Hartel’s famous quote from that article is very enlightening.

The key to making programs fast is to make them do practically nothing. ;-)

So essentially KDBQ is slower than grep because it still thinks of strings in each row as independent rows and it isn’t designed solely for ripping through strings. The tool was originally the brainchild of Ken Thompson of K&R C fame. So is it surprising that after 40 years the tool that was optimized for string searching is still the king? Not really, but we thought Arthur could do better, we eagerly await K6 which promises to be an entire OS, which will probably include something like grep. (I later realized that I could use Q in a different way to achieve the same result much faster, keep reading… )

A quick aside, for those unfamiliar with grep. Grep is a Unix command utility, that allows searching files for strings that match on a line. Many options were added as Grep developed. Grep gave you a way to search for regular expressions. That is, for things that are similar to a certain string. The extended regular expression engine was originally Aho’s invention and was called EGrep. This merged into Grep as the argument -e.  FGrep gave you a way to search for exact matches much faster which became -f. Similarly, -n allows you to print line numbers and -o allows you to print only the matching pattern.

Anyway, knowing something about the algorithms that allow grep to run so quickly, I decided that it was time to try a more realistic example. Firstly, I thought that longer words might further advantage grep since it could skip more quickly through the file, secondly, I wondered how linux would handle opening many background processes at once.

So to construct the more realistic dataset I went and found a file that contained the 20,000 most common english words.

I read the files into KDBQ and created my tokens and my table:

q)tokens:read0 `:20k.txt
/now instead always taking 5 random tokens we are randomly taking between 1 and 10 /random words without replacement, so that we get a unique set for each row.
q)g:{” ” sv (neg first 1+1?10)?tokens}
q)t:([]d:g@’1000000#0)
q)\t {where {first `boolean$ss[y;x]}[x] each t.d} peach 10#tokens
2453
q)\t {where {first `boolean$ss[y;x]}[x] peach t.d} peach 10#tokens
2498
q)\t {where {first `boolean$ss[y;x]}[x] each t.d} peach 100#tokens
12188
q)\t {where {first `boolean$ss[y;x]}[x] peach t.d} peach 100#tokens
12329
q)\t {where {first `boolean$ss[y;x]}[x] peach t.d} peach 1000#tokens
184171

What we see is that getting the list of indexes for 10 tokens takes about 2.5 seconds, with no significant difference from adding another peach. Getting the indexes for 100 tokens takes 12 seconds, which looks like an improvement, to about 1.2 seconds per 10 tokens. However, getting the indexes for 1000 tokens takes about 3 minutes for an average of about 2 seconds per 10 tokens. So KDBQ is definitely performing better on longer tokens, and tokens get longer deeper into the list so that is contributing to the speedup. However, we really are just using KDBQ to place the files on the filesystem for grep to pick them up.

q)save`:t.txt
/save only the top 100 tokens in a separate file
q)tokens:([]100#tokens)
q)save `:tokens.txt

Now we can see how grep does on this much more realistic problem.

Running grep on 100 tokens; we get the following timing result:

time $(while read LINE; do grep -n ” $LINE ” t.txt| cut -d: -f1 > results/$LINE & done <tokens.txt; wait)

real 0m4.460s
user 0m6.684s
sys 0m0.884s

So for 100 words it took 4.46 seconds, so less than 5 seconds. I decided to run it for the whole set of 20,000 words:

Here are the timings experimenting with using Fgrep and using and not using background processes:

It took 5 minutes to run on 20k tokens for a million line table.

time $(while read LINE; do grep -n ” $LINE ” t.txt| cut -d: -f1 > results/$LINE & done <20k.txt; wait)

real 4m54.079s
user 17m13.756s
sys 2m9.064s

Not running in parallel is a bit faster in terms of total CPU time, but is almost 6 times slower in wall clock time, which means you will be waiting awhile.
time while read LINE; do grep -n ” $LINE ” t.txt| cut -d: -f1 > results/$LINE; done <20k.txt

real 23m41.263s
user 14m14.508s
sys 1m57.508s

Using Fgrep was marginally faster because it searches a literal string without any pattern matching support, if you need pattern matching then you can’t use this, but the performance improvement was not all that significant. (This should have been a clue that something was wrong, see later in this post)

time $(while read LINE; do fgrep -n ” $LINE ” t.txt| cut -d: -f1 > results/$LINE & done <20k.txt; wait)

real 4m29.731s
user 17m13.004s
sys 2m2.336s

Now multiplying through for the more realistic example we find that (1 million/20k)*5 minutes=4.1 hours, multiplying by 10 we get 40 hours. So it seems that our toy problem was underestimating the difficulty of the problem.

40 hours seemed like too much time to wait. So off I went in search of something better. Luckily, I came across ripgrep. Ripgrep aims be to be a better version of Grep written in Rust instead of C. [Rust is a language that aims to be a better version of C, so this is fitting. The name rip grep, I’m sure has something to do with R.I.P grep]. The post  on Ripgrep taught me a lot about the underlying implementations of different search tools. It made me realize that some of the information I was reading from older posts was outdated. In particular, grep no longer used memory maps (–mmap), even if you asked for them explicitly. So grep was fast, but not as fast as it could be. Also, the newest grep had somehow increased it’s speed on non-ASCII characters. So some of the traditional speedups like setting up LC_ALL=C, which forces grep to use ASCII characters doesn’t actually increase the speed. Which many of these StackOverFlow answers mention here, here, and here.

Here are some timings that explicitly contradict these results:

Searching for aardvark in a 10 million row file results in no speed improvement when switching between ASCII and UTF-8

$ time LC_ALL=en_US.UTF-8 grep -F aardvark -w -o -n t.txt > results/aardvark

real 0m0.339s
user 0m0.260s
sys 0m0.076s
$ time LC_ALL=C grep -F aardvark -w -o -n t.txt > results/aardvark

real 0m0.340s
user 0m0.276s
sys 0m0.064s

Also, many posts recommended using GNU parallel, but I found it did a poor job of actually distributing work onto all the cores of my CPU and that it was probably much better if you were going to send work to multiple machines. Instead simply using background processes was going to be fine.

Once I read the article about ripgrep, it was time to test its bold claim that it was going to be faster than all the other tools. My more realistic example with 20k words became really easy for ripgrep, once I figured out that the 20k tokens can be split into  smaller 200 line files with split.  Split is a tool written by Richard Stallman that splits the original files into files with the prefix specified. Here I use ‘S’.

$split -l200 20k.txt S

The small-token-files could be used to match against the large input and output a line number and a match on each line.

For example, in our toy problem, if we want to match on letters a-z. We can break it up into chunks of 2 letters. We will get 13 files.

file 1.–> corresponds to letters a,b
1:a
3:b
4:a
4:b

file 3. –> corresponds to letter e,f
1:e
15:f
74:e
80:e

So that for each token-file we would get a corresponding file with matches. Here is that command and timing:

$ time $(for i in S*; do rg -F -f $i -n -o -w t.txt>results/groupby$i & done; wait)

real 0m36.187s
user 3m23.104s
sys 1m15.308s

Now my more realistic problem seemed like a joke. It was taking less than a minute to go through a 10 million row file with 20 thousand tokens. I was starting to become very impressed with ripgrep. However, I decided that I first needed to try and find the older versions of grep that supported mmap to make the comparison fair.

After building from source no less than 8 versions of grep, I discovered some very curious things.

  • grep lost mmap support as of 2010, so any version after 2.6 essentially was not going to work, and my current ubuntu only came with 2.25, 25 is larger than 6.
  • Old versions of grep were in general faster on large input patterns than new versions, they built the Aho-Corasick tree faster.
  • Grep versions before 2.5 did not support printing only the match (-o) option, which was necessary for us to collect all the indexes back together at the end.
  • So Grep versions 2.5.x were the only ones to support both -o and –mmap options

I ended up settling on grep 2.5.1 for a very strange reason.  I was excited that grep 2.5.1 was going really fast on some test inputs, when I discovered that when matching on multiple patterns at once and printing the line number. I would get the following output:

1:tenor
bleeding
fertilizer
colchester
2:mining
swap
everyone
miriam
persuasion
….

It seemed that grep had decided that it would be easier to just print the line number once per many matches. So I went to the next version grep 2.5.4, which printed the matches as expected.

1:tenor
1:bleeding
1:fertilizer
1:colchester
2:mining
2:swap
2:everyone
2:miriam
2:persuasion
….

However,  it was an order of magnitude slower. So I ended up attaching a small AWK program that would add the line numbers for the lines that had them implicitly. Here is that AWK program:

awk -F: ‘NF==2{line=$1} NF==1{$0=line FS $0}1’

This AWK program had almost no overhead when I tested it so it was much better than relying on grep to print all the line numbers for each match.

Once I had the right version of grep, things really started moving. I discovered that grep could match many more tokens at once than ripgrep.

In fact, my more realistic example with 20 thousand tokens ran in the same amount of time whether or not I split the token files. In fact, the overhead of starting many processes was slowing down grep2.5.1.

MOST REALISTIC PROBLEM, (yet):

This led me to go and create the most realistic example I could without generating nonsense words myself. (I might do this later, but I was feeling a bit lazy and I think this problem is close enough). I went here and got 355k words. I went and recreated my dataset. I had a file of tokens.txt which had 354950 words to be exact. I deleted about 40 words that started with a ‘(single quote) because ripgrep has a bug (bug was fixed within 14 hours of posting it, amazing!) that prevents it from properly printing out only matches that match exactly that word[-o -w]( a very low priority bug , but I figured it was worth filing and skipping those words). I created my 10 million row dataset which had a random number of words between 1 and 10 from the tokens file. I saved that into t.txt.

I then created two very similar scripts that allowed ripgrep and grep2.5.1 to finally compete.

grepper.sh:

#!/bin/bash

FILE=$1

export LC_ALL=C
pid_count=0
for F in S*
do
~/Downloads/grep-2.5.1/src/grep –mmap -F -f $F -n -w -o $FILE|awk -F: ‘NF==2{line=$1} NF==1{$0=line FS $0}1’ >results/group$F &
if (( ++pid_count > 4 )); then
wait -n
((pid_count–))
fi
done
wait

rger.sh

#!/bin/bash
FILE=$1

pid_count=0
for F in S*
do
/home/pinhus/Downloads/rg –mmap -F -f $F -n -w -o $FILE >results/group$F &
if (( ++pid_count > 4 )); then
wait -n
((pid_count–))
fi
done
wait

The two programs behave very similarly and produce identical output. So I will describe them together.

The program takes a filename argument $1 and creates a count of how many processes are running so far. For each pattern File S* it runs a literal match on each patternfile and prints the line numbers along with each match into a corresponding file inside the results folder called group{patternfile}. It runs this in the background and increments the number of running processes. It waits until there are less than 4 running processes, decrements the number of processes and adds another process for the next patternfile. When it is all done it will wait until all the pattern files are written out.

The main difference between them is that the grep version uses strictly ASCII characters whereas the ripgrep version uses UTF8 by default.

There is also a difference in how to optimally split the pattern file. After much experimentation, I found that ripgrep could comfortably handle up to about 400 lines at a time, whereas grep2.5.1 was optimal with 20k lines at a time.

Here are the respective split commands:

#grep split, assuming 355k lines token file
# -n will create N chunks and not break a lines if you add l/N
split -n l/20 tokens.txt S

#ripgrep split assuming 355k lines in token file
split -n l/888 tokens.txt S

As you can see there are only 20 chunks created for the grep version and there are 888 files for the ripgrep version.

However, the real kicker is that the grep version now runs faster, because we have taken care of the overhead of having to many process handles at once.

Here are the timings:

time bash grepper.sh t.txt

real 1m27.812s
user 2m55.840s
sys 0m2.100s

time bash rger.sh t.txt

real 3m29.019s
user 25m19.516s
sys 1m55.148s

Long Live grep!

Multiplying to the original problem of 1 million tokens we now see that grep will take about 4.5 minutes. Let’s say 5 minutes to be safe. And ripgrep will now take about 10.5 minutes.

The reason that grep now beats ripgrep is that grep uses Aho-Corasick to do multiple matching, whereas ripgrep actually uses a clever version of regex. Instead of implementing Aho-Corasick, ripgrep joins the entire pattern file with | which makes the regex engine match any of the those patterns. Ripgrep then goes one step further and tries to optimize this by searching for a common literal among all those patterns so that it can skip using the regex engine which is much slower. If it finds a common literal string among the patterns it will submit it to it’s implementation of the teddy algorithm, which takes advantage of Intel’s more recent SIMD instructions (single instruction multiple data, AKA vectorized processing). The teddy algorithm submits the literal to be checked against 16 bytes at a time and if it finds a match it will go into the slower regex engine to verify. This implementation forces our pattern files to be smaller since a pattern-file that is too big will not have a common literal or will match too often. In fact, the pattern-files have a limit that will throw an error if they are too big because they overwhelm the regex engine. Because of this limitation, ripgrep has to search the entire file many more times than grep has to. Additionally, ripgrep will perform terribly if the original patternfile is not sorted. I suspected that ripgrep was benefitting from our sorted tokens file, and was able to find common literals, which was allowing it to be much faster than a random tokens file. To test this I used round robin split,

#ripgrep split assuming 355k lines in token file
# replace the l with r. which causes the lines to be round robined across 888 files.
split -n r/888 tokens.txt S

The difference was staggering, using a 100 thousand line input file:

#pattern file is sorted and broken into sorted chunks
time bash rger.sh t100k.txt

real 0m8.205s
user 0m59.716s
sys 0m1.400s

#pattern file is deliberately round robined across chunks
time bash rger.sh t100k.txt

real 1m11.450s
user 9m8.252s
sys 0m2.092s

The difference is between 8 seconds for a 100 thousand row file and 1.2 minutes. Almost an order of magnitude difference. Sort your pattern file before using ripgrep. Also because ripgrep uses a nonnaive version of Boyer-Moore it is able to skip many more bytes. (We eagerly await this optimization from Arthur, after all SIMDs are a natural part of vectorized languages.)

On the other hand, because grep uses Aho-Corasick it can handle much larger pattern files, the chunks merely facilitate creation of a smaller tree to traverse and you don’t need to sort your pattern files. This also allows grep to read the input file many fewer times. Once again, algorithms beat brute force  (kind of, see last note where I discover that many matches were missed because of a bug in 2.5.1, most of the points stand but I think you might be better off using ripgrep).

Now to collect the outputs and create the final result, which is a token and all of it’s associated row numbers, we go back to KDBQ.

All of our output files live in the results directory and look something like this:

402:bicultural
630:bicapitate
697:bibby
861:biblicist
945:bicylindrical
951:bibliopolist
1200:bicalcarate

So we need a function that gets all the files from a directory:

tree:{ $[ x ~ k:key x; x; 11h = type k; raze (.z.s ` sv x,) each k;()]}

This is one of the utilities provided in the KDB Cookbooks/tips. It will recursively go down a directory and give you all the files. We just need 1 level, but it’s fast enough, so there is no need to write our own.

We create a table that will have two columns, an index column and a token column.

t:([]index:();token:())

We then create a small function that will read in each file split on the colon and parse the line and upsert it into the table.

{`t upsert flip 0:[(“IS”;”:”);x]}

0: will read a text file and takes arguments that specify what it expects see, as well as a delimiter. In this case a colon.
flip arranges the tuple to be aligned with the table
`t upsert will insert tuple.

By default 0: will read the two columns into rows, so we need to flip it.
q)(“IS”;”:”) 0: `:results/groupSaaa
72 165 567 698 811 838 1003 1136 12..
abiliment abience a1 4th aberuncator abbey abductors abdominohysterectomy 10..
q)flip (“IS”;”:”) 0: `:results/groupSaaa
72i `abiliment
165i `abience
567i `a1
698i `4th
811i `aberuncator
838i `abbey

Running this on each file in the results directory like so:

q)\t {`t upsert flip 0:[(“IS”;”:”);x]} each tree `:results
4361
q)\t select index by token from t
567
q)select index by token from t
token| index ..
—–| ———————————————————————-..
1080 | 1283 127164 391580 443156 543775 561432 608795 862810 889780 1029079 1..
10th | 27747 57768 238851 293795 332990 487295 564854 732020 755630 834528 83..
1st | 37915 113165 175561 203936 213171 318590 382626 411531 473224 653353 6..
2 | 26642 67272 108161 122598 192496 211019 250924 281576 300950 344445 45..
2nd | 3275 50750 93226 99089 107105 208955 212762 234366 238134 403142 48216

Reading in all the output files and aggregating by token takes less than 5 seconds.

However, once I did this aggregation, I thought there must be a better way to take advantage of a KDB only solution.

PURE KDBQ:

If you knew that in the description, the only tokens you were interested were words and that the only types of word breaks were spaces then you could do following:

/read in the raw 10 million row file
q)read0 `:t.txt
“cellulipetal outname kymograms splendiferousness preneural raploch noncontin..
“copernicus”
“overglanced”
“consists”
“epicure casern mythologization”
“angiosymphysis”
“shellfishes renniogen lactonic”
..
/define raw as this file
q)raw:read0 `:t.txt
/split the first row on spaces
q)” ” vs first raw
“cellulipetal”
“outname”
“kymograms”
“splendiferousness”
“preneural”
“raploch”
“noncontingency”
..
/cast each word to a symbol
q)`$” ” vs first raw
`cellulipetal`outname`kymograms`splendiferousness`preneural`raploch…
/apply this function to each row in raw and call it sraw
q)sraw:{`$” ” vs x} each raw
/this takes about 19 seconds
q)\t sraw:{`$” ” vs x} each raw
18744
/sraw looks like this
q)sraw
`cellulipetal`outname`kymograms`splendiferousness`preneural`raploch..
,`copernicus
,`overglanced
,`consists
`epicure`casern`mythologization
,`angiosymphysis
`shellfishes`renniogen`lactonic
..
/create an index column that tracks each row (add 1 so that it will agree with grep)
q)index:1+til count sraw
q)index
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29..
/join together each row number with every symbol in that row
q)flatrows:raze index {x,’raze y}’ sraw
q)\t flatrows:raze index {x,’raze y}’ sraw
7634
/flatrows looks like this:
q)flatrows
1 `cellulipetal
1 `outname
1 `kymograms
1 `splendiferousness
1 `preneural
1 `raploch
1 `noncontingency
..
/create a new table that has a rid and a token
q)t2:([]index:flatrows[;0]; token:flatrows[;1])
q)\t t2:([]index:flatrows[;0]; token:flatrows[;1])
4631
/t2 looks like this:
q)t2
index token
———————–
1 cellulipetal
1 outname
1 kymograms
1 splendiferousness
1 preneural
..
/aggregate and get all indexes for each token
/this takes less than 8 seconds
q)\t res2:select index by token from t2
7741
/res2 looks like
q)res2
token| index ..
—–| ———————————————————————-..
1080 | 1283 127164 391580 443156 543775 561432 608795 862810 889780 1029079 1..
10th | 27747 57768 238851 293795 332990 487295 564854 732020 755630 834528 83..
1st | 37915 113165 175561 203936 213171 318590 382626 411531 473224 653353 6..
2 | 26642 67272 108161 122598 192496 211019 250924 281576 300950 344445 45..
2nd | 3275 50750 93226 99089 107105 208955 212762 234366 238134 403142 48216
/total time without ever leaving kdb: 19+8+5+8=40 seconds!!

/If you are not interested in all the tokens you can easily subselect
/as an example asked for a 1000 random subsample of tokens from res2
q)sampleTokens:neg[1000]?(0!res2)[`token]
/sampleTokens looks like this
q)sampleTokens
`bimillionaire`abarthrosis`krone`volleyers`correct`pitarah`aeroelastic`zilch`..
/I can ask for res2 where the token is in sampleTokens
q)select from res2 where token in sampleTokens
token | index ..
————| —————————————————————..
abaciscus | 19644 81999 125546 145360 159922 213205 223650 227560 353332 41..
abactinally | 92704 265473 480112 499419 510784 539605 617031 668847 752500 8..
abarthrosis | 25638 61103 215367 281385 305352 385693 553059 585997 598293 66..
abir | 96166 214361 258595 259398 309117 387755 419601 720156 746810 8..
abstemiously| 4611 27043 61983 72530 140138 152631 227466 272363 278554 31059..
..
/this takes a millisecond for 1000 tokens
q)\t select from res2 where token in sampleTokens
1

So if you know that everything is in ASCII and you know that the words in the description are space delimited nothing beats KDBQ. Arthur you win. I did think to tokenize my inputs, but I didn’t think to wrap them with an index number, until I was collecting the results.

A disclaimer where I suggest using ripgrep until grep reintroduces mmap for large files. Old versions of grep have bugs that are not documented all that well.

On the other hand if you need UTF-8 use ripgrep and if you need regular expressions use ripgrep. The recent lack of support for –mmap in grep means that the chance of doing better with it requires you to build it yourself and deal with bugs.

(It turns out that grep2.5.1 has a bug where -w will not match word boundaries correctly and 2.5.4 fixes that bug but introduces a performance penalty by printing every line, for every match – a task much better suited to AWK. To work around this and still get the same performance, you either need to add spaces to your matches or turn off -w and deal with the overlap matches that will inevitably occur.) I ended up using the command below that mostly did away with the performance penalty, but was still slower than ripgrep.

On the other hand, the bug I found in ripgrep while writing this post was fixed within 14 hours of me reporting it. A serious benefit of a tool that is actively supported.

time cat tokens.txt | parallel –pipe –block 1000k –round-robin grep2.5.1 -Ff – –mmap -on t.txt |awk -F: ‘NF==2{line=$1} NF==1{$0=line FS $0}1’ >results/test

real 4m41.146s
user 15m35.820s
sys 0m8.096s

I didn’t get a chance to fully optimize the block size, but it’s within reach of ripgrep.

GNU Parallel was used:

O. Tange (2011): GNU Parallel – The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.

Idempotent Cast

Idempotent:  denoting an element of a set that is unchanged in value when multiplied or otherwise operated on by itself.

Wikipedia gives an easier definition. A unary operation (or function) is idempotent if, whenever it is applied twice to any value, it gives the same result.

Or more formally in f(f(x))=f(x) .

If you had to choose the top ten functions in programming that should be idempotent, casting would be near the top of list.

Casting is how you can convert one type to another. A classic example is taking a user input string and reading it as a number. Nitpickers will argue that this should be called parsing and that casting is only when you have known inputs. I think this line of thinking is flawed and a poor abstraction. If I get an input I just want to call a function whose responsibility is to cast it to the right type. If this can’t be done the function should return an error which I will then return to the user. The user should be able to call my function without caring too much about what he puts in and I should try to make it easier by being smart enough to cast it to the appropriate type.

KDBQ has some built in cast functions. However, almost none of them are idempotent, which is really annoying. If I get an input and I know how to cast it into the type I need, I shouldn’t worry that it is already in that type. For example I may need a integer. If you give me a string, like “4” and I cast it to an integer that should work, and if you give me 4 and I cast it that too should work.
This came up recently when I needed a function that cast a string to a symbol. So here is Idempotent cast to symbol:

{`$raze string x}

Let’s break this apart:

string will take an argument and turn it into a string. However, string is atomic so a “hello” becomes

,”h”,”e”,”l”,”l”,”o”

With each character becoming a list. To get back the original we raze it. Now that we have a string that we can cast to a symbol. With `$ cast verb.

If we get a symbol. For example `hello We turn it into a string which gives us “hello”. We then cast it to a symbol.

We can do this because raze is idempotent on a list of 1. So it will just return the same list.

If we wanted we could achieve the same result by checking it was a symbol and returning the symbol otherwise casting it to a symbol.

That function looks like this:

{$[11~abs type x; x; `$x]}

This takes advantage of the triadic if ($) verb. The first argument must return a boolean, that is a 0 or 1b. The second is evaluated if it’s true the third argument if it’s false. This is actually how most programming languages would solve a problem like cast. Essentially creating code branches. In this case this is probably okay, but in general KDBQ pushes toward designing functions that just work without creating separate code paths.

 

There will definitely be another tutorial on control structures in Q in the spirit of the Mcdonnell article in Vector. Readers familiar with most other programming structures will find that introducing control structures so late kind of weird. However, control structures actually handicap abstract and general thinking which is the hallmark of clean code.

 

The problem with {subject} 101

Recently I saw this old (by internet standards) video of Elon Musk explaining what we should do about Global Warming in 12 minutes.

The talk doesn’t spend much time defending the science behind global warming and instead works around the issue. First, Musk states that 97% of the scientific community believe that not changing global energy habits will result in catastrophic loss of life/wealth and 3% don’t. He then says that since oil energy is not sustainable, we will eventually move to a sustainable energy source. This is a tautology. In the long run we will need to be on a sustainable source of energy otherwise it won’t work for the long run. Since the choice becomes when – not if – to move to sustainable energy, he claims we might as well do it sooner.  He then suggests that a properly functioning market economy should price in negative externalities so as not to create incentives for bad behavior. He then says this is economics 101.

The phrase this is Economics 101, is a very misleading phrase. Presumably, the topics get more sophisticated as you progress through the subject. Those simplifying assumptions made in 101 are no longer true. For example, knowing the cost of a particular negative externality is hard, it is part of a class of problems known as the free rider problem. Does Elon Musk know the price of the negative externality of continuing to use fossil fuels? It certainly isn’t infinite, we can be pretty sure that we would not be able to fulfill the world’s current demand for energy even in 10 years with only clean energy, which means that this tax has to be gradual (he mentions this in the video, but where he chose 5 years as opposed to 7 or 3 is unclear). Furthermore, the tax that is collected strictly from fossil fuels must end up somewhere, we would like that it provide some public good, but we can’t be sure. In many oil producing countries this tax is a significant source of revenue for the government, so levying a tax sufficient to shut down the industry is not in their best interest. Many social programs in those countries would have to be curtailed. One additional point, in the short run any tax on energy is sure to harm poor people the most. Energy cost accounts for significant percentage of many products and this is very true of grocery items. When the price of energy goes up those impacted will be those whose incomes are disproportionately spent on items influenced by the price of fuel, namely the poor.

I am sure that Elon could answer all of these questions or knows people that can answer them. My issue is that complex problems are proposed as simple problems with simple solutions and base themselves on a rudimentary presentation of the subject. It would be like saying you just need to reach the escape velocity in order build a rocket, it’s technically true, but it wouldn’t have gotten Elon very far with spaceX.

So the next time someone says: it’s {subject} 101. Ask  yourself are there any simplifying assumptions that might be contradicted by {subject} 501.

 

 

 

 

 

 

 

 

 

 

 

 

 

An Impractical Storage Solution: That might solve a very practical problem

It was Pi day recently and one of the interesting facts about Pi is that it is believed to be normal. Pi has been proved to be an irrational and a transcendental number. A real number x is normal in base b if in its representation in base b all digits occur, in an asymptotic sense, equally often. Proofing this has eluded mathematicians until now.

However, if we take this conjecture for granted. Then any finite string of numbers will be found in Pi. Which means that theoretically, given infinite computation power you would be able to compress any amount of information into two numbers. The first number stores the length of the information you want to compress and the second stores the starting point in Pi where you could find that string. We might want some more information in the metadata for example if a particular piece of information is very far into the Pi sequence than storing the number that represents where it can be found might be larger than the data in which case we might store a pointer to that number and so on. Ultimately, this is futile because there might be situations where storing the first pointer and it’s length and the depth that we need to follow the pointers might still be longer than the data we need to store at that point it might be worth gzipping the data and calling it a day, of course if you are committed you might consider breaking up the data into partitions and then finding the subsequences in Pi and storing those, the odds are those subsequences will appear earlier.

An important point to note is that this type of storage mechanism does not violate information theory. Although in general the number of bits it takes to store or transmit anything is a function of Shannon’s entropy. In general, Shannon’s definition of entropy is the minimum channel capacity required to reliably transmit the source as encoded binary digits. However, because there is actually infinite entropy in the digits of Pi and Pi is deterministic, we can essentially convey the message as a function of Pi. Now, this whole post promised something practical and this discussion so far has been theoretic.

One of the information theory problems in biology is how to account for all of the behaviors and advanced functions in advanced life forms. We know that learned behavior gets modeled in the brain after birth and has many places to hide (researchers try to catch this learning with relatively primitive imaging techniques). However, unlearned behavior or what we would call instincts and lower all fit into about 700mb of space,  human DNA. The problem is that given how much stuff goes on that is unlearned it seems unlikely to fit into this very small amount of space. The DNA codes for how much saliva to release into your mouth cavity,  as well as how to focus your eyes, or what to do during a sneeze….

So what is the solution to this seemingly paradoxical puzzle. I propose (I haven’t seen this anywhere so please correct me if there are better answers) the solution is something very similar to the impractical storage solution. Essentially the laws of nature are an infinite calculation engine and though there is a bit of probability implicit in them, in the macroscopic universe the answers are deterministic. This means that DNA can simply be a mechanism to force all of the behaviors necessary to survival to happen. Any DNA that didn’t cause the infinite calculator to spit out a species that survived would die out. In this view DNA is not the total information content of our species, we need the observable universe there as well, which includes a lot of  Physics and Chemistry (possibly all, but some of these subjects deals with conditions never occurring on Earth’s surface).

Of course, I am exaggerating a bit in telling you that this Natural computer is infinite, in fact, Einstein limited its computing speed ton the speed of light, which means no information travels between individual components in the computer too quickly. Still, nature does do an impressive amount of computation.

This also explains, some of the advantages of carrying a child in utero. A child in utero learns about the environment from its mother. There have been studies that show that a baby can recognize its mother’s voice, in other words information about the environment seeps into the neural pathways of a fetus. Perhaps that is the further advantage of marsupials which get to have a guided sighted tour of the world. Whereas newborn reptiles and birds rely predominantly on instinct which limits the amount of knowledge they can have about the world.

 

 

 

 

 

 

 

 

Incremental Quantiles/Percentiles with KDB/Q

A friend of mine asked about incremental quantiles, which I took to mean get incremental Quantiles without doing much work .

Quantiles are an inherently controversial topic among statisticians. For evidence of this look at the quantile command in the R stats package which lists nine methods for calculating this quantity from a data set.

For those who don’t know, essentially quantiles and their various flavors: percentile, decile, quartile and so on, is a method for creating equal size buckets for data. This how standardized exams can tell you that a particular score implies you did better than some percent of the population.

Not knowing much about the topic myself. I decided to see what opinion the authors of KDB had.

KDB does not provide a built-in quantile method. KDB does have a stats.q package which implements 4 of the 9 methods mentioned above.  However, KDB does come with a built in verb called xrank.

First, let’s understand the problem a bit. The hint is that xrank is a special type of rank. Which means that quantilling is somehow related to ranking. Ranking is the most general version of quantilling. Suppose you had a list and you wanted to put each item in it’s own bucket and then label the buckets from smallest to greatest. That is exactly what rank does. Rank tells you where each item in the list belongs if it were sorted. For example:

q)rank 43 4 77 54 23 0 67 57 27 38
5 1 9 6 2 0 8 7 3 4

xrank is the special case where you only want x buckets. This is easily achieved by first ranking the list multiplying each rank by x and then  using integer division by the number of elements in the list. Suppose we want 4 buckets on the previous list:

q) 4 xrank 43 4 77 54 23 0 67 57 27 38
2 0 3 2 0 0 3 2 1 1

Okay, so we see that we can easily group items in a list into roughly equal buckets. We also learned that bucketing is basically sorting and then adding little markers every length of the list divided by number of buckets.

If stars are values and vertical bars are markers. Bucketing is sorting the stars and then dropping vertical bars in between:

***********|***********|***********|***********

What is very nice about this particular opinion expressed by KDB is that no model is better than a wrong model. Specifically, the only way to bucket data is to look at all the data that you have sort it and then insert the markers at equidistant points. Depending the on distribution the buckets might be spaced evenly with respect to the values or not. KDB does not care because the values have already been sorted.

So having this wonderful tool in hand, we proceed to look at the problem of incremental quantile. Our first reaction is: Can’t be done! How can we possibly know where to put the markers if we don’t use the whole data set. Ignore that and march on.

We find that there exists some interesting things with regard to incremental sorting. This is a possible avenue, but it seems complicated.

In cases like these, I like to call upon my genie. My genie can somehow solve the problem so quickly that the method he/she(it’s a gender neutral genie) uses is irrelevant. What does my genie return to me, well the genie tells me into which bucket I should place my current value, it does this using all the data I have up till this point (My genie doesn’t time travel or at least she/he doesn’t tell me about it).

Now I proceed with the following assumptions. If the data was negligibly small, then I don’t need a genie I can just use xrank as above. However, if the data is relatively large much larger than the number of buckets, then with a very small error you can become the genie.

You will need:
2 lists

Yep that’s it! Everything else will be provided to you.

  1. We start by using our good old xrank on all the data until the data is too large.
  2. We  use the first list to find the “You must be this high to be in this bucket” for each bucket. These are our partition markers.
  3. We use a second list to keep track of the total counts in each bucket. Our counts (kind of sounds like accounts).

Every time a new value comes in we find it’s bucket using the bin verb. Which applies binary search on the list to it’s left and returns the bucket number for the input to it’s right. For example:

q)0 5 10 20 bin 17
2

Because 17 is less than 20. So it goes in the previous bin whose minimum height is 10.

We update the counts to say that the current bucket has another item.

Here is where the magic happens: We multiply the current count against the number of buckets, if that number is higher than the total number of observations by some percent we rebucket. Otherwise if it is within our error band, we are ready for the next value.

One final note, if our error band is very small, this will have to do the sort operation much more often. However, the law of large numbers is on our side, that is as long as we are sampling from the same distribution the buckets will be right, as soon as the distribution changes the buckets will be wrong and we will rebucket. The sensitivity to a change is distribution is determined only by the error band.

I really like this approach, because it doesn’t actually care about how you get the values, all that matters is the values you have access to. Your buckets are your strongest bet on where things should land, if a bucket gets heavier than the rest, you become suspicious of your bet.

At the bottom is all the code that implements this logic.

//optimbucket is an optimized bucketing strategy
// it trades accuracy on buckets for speed
// by keeping track of how equally buckets are sorting data
//it takes a dictionary with the following keys:
//key t total observed elements
//key n number of buckets
//key bkts an ascending list of floors of each bucket,
// length should be n
//key e acceptable error before expensive recaluculation of buckets
//key c totals for each bucket,
// length should be n
//key s source for complete list of data in case of recalculation
//key b the bucket of the current element
// (could be empty)
// This is where the result is returned
optimbucket:{[d;v] d[`t]+:1; $[(d[`t]*d[`e]+1)>d[`n]*d[`c;d[`b]:|[0;d[`bkts] bin v]]+:1; d;fallback[d;v]]}

/helper functions:
/redobuckets gets the buckets from the actual source
redobuckets:{[d] g:value asc {(min x; count x)} each s group xrank[d`n;s:get d`s]; d[`bkts]:`s#g[;0]; d[`t]:sum d[`c]:g[;1];d}
/fallback is our method if buckets are wrong
fallback:{[d;v] d:redobuckets[d]; d[`b]:d[`bkts] bin v;d}
/Test this
//append and bucket, gets a value adds it to s the store
/and returns the bucket
apbk:{store,::x;`d set optimbucket[get `d;x]; (get `d)[`b]}

/gen generates data with pseudo normal distribution around 0
/x 10 uniform is usually enough
gen:{neg[(x%2)]+(sum x?x)%x}

/setup store for data
store:10?1.0
/setup dictionary
/we will do quartiles so n is 4
d:`t`n`bkts`e`c`s`b!(0;4;til 4;.01;0 0 0 0;`store;0N)
/intialize
d:redobuckets[d]
/gen some data and watch how d changes
data:gen each 1000#10
constantdata:1000#1
uniformdata:1000?1.0
apbk each data
d
apbk each constantdata
d
apbk each uniformdata
========================================
sample output
========================================
q)d
t | 10
n | 4
bkts| `s#0.01867974 1.218468 6.321394 8.083308
e | 0.01
c | 3 2 3 2
s | `store
b | -1
q)apbk each data
0 0 0 0 1 0 1 1 2 2 0 2 0 0 1 1 2 1 1 2 1 0 0 0 1 1 0 1 2 2 0 2 2 1 0 0 3 0 2..
q)d
t | 1010
n | 4
bkts| `s#-3 -1.1 -0.5 0.2
e | 0.01
c | 254 252 253 251
s | `store
b | 0
q)apbk each constantdata
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3..
q)d
t | 2010
n | 4
bkts| `s#-3 -0.5 1.9 2
e | 0.01
c | 501 501 501 507
s | `store
b | 3
q)apbk each uniformdata
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1..
q)d
t | 3010
n | 4
bkts| `s#-3 0.04688868 0.6175779 2
e | 0.01
c | 748 757 758 747
s | `store
b | 2

 

Prototype for Fuzzy Grouping

This is a sketch of how you can take user input fields from a survey with free text and try to do some kind of grouping of the data. I will describe my strategy but first a warning. This code should not really be published, but the idea could be interesting.

WARNING: The following code is

  • not performance optimized, AT ALL
  • makes use of global variables as a hack
  • generates different quality groups based on the random order of the input list

Alright, you’re still reading.

I came across this problem when dealing with classification of service tickets. Each service ticket represents a unique real world event and it would be nice to classify the events. In particular we are interested in finding a group of incidents that are in some way common such that we can avoid them in the future (optimistic little creatures we are). Each of these tickets comes with a free-text field that describes the resolution, in a few sentences. Humans are pretty lazy, so they are often copying from previous resolution descriptions, which at least makes it easier to group similar tickets, (of course it leaves open the possibility that the resolution description is irrelevant to the incident). However, humans are also careless so when they copy and paste, they may leave off some parts, or simply type something similar.

My strategy was to use the fuzzywuzzy library kindly developed and open-sourced by seatgeek. The library has several useful features.

Firstly it can find the ratio between two strings and it has several methods available. One has to know a little about their domain to determine which strategy to use. For example:

fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100
fuzz.token_sort_ratio(\
  "New York Mets vs Atlanta Braves", \
  "Atlanta Braves vs New York Mets") ⇒ 100
fuzz.token_set_ratio(\
 "mariners vs angels",\
 "los angeles angels of anaheim at seattle mariners") ⇒ 90

Secondly, given a particular strategy and a list of words fuzzy wuzzy can try to extract the best inexact matches and it can use both a hard limit on the number of items to return and a minimum ratio to be considered a match.

Given a dataset, pandas and fuzzywuzzy we can group some data.

Imports:

import pandas as pd
import numpy as np
import difflib
from fuzzywuzzy import fuzz, process

First let’s clean the data a bit:

df = pd.read_csv(“dataset.csv”) df.userinput=df.userinput.apply(fuzz.utils.asciidammit)

We need to setup the list of items to group: choices = list(df.userinput) Unfortunately, this list is a global variable so that as we progress choices that have been added to the groups previously don’t get added again. In theory, you can get a serious performance increase from removing them from the query, but that requires adding an extra bit of branching logic in the extract function. (maybe later)

Next, we create an extract function that will take only a token and return a Boolean array that masks those elements not in it’s group. This way we can take advantage of pandas apply functionality. We also remove from the choices variable those that we have already grouped.


def extract(s): global choices res = process.extractBests(s, choices, \ scorer=fuzz.token_sort_ratio, score_cutoff=80 , limit=100) r = [x[0] for x in res] choices = [x for x in choices if x not in r] return list(df.userinput.isin(r))

Now we create our masks

masks = df.userinput.apply(extract)

We can then use our mask to filter and see all the elements that are similar to a particular item. For example, this gives us what matched the first row:

df[mask[0]]

Here is how we can see how many are in each group:

masks.apply(np.count_nonzero)

Finally, I present the complete code that includes a method for running on small subsets of the original set until some portion of choices is covered.

choices = list(df.userinput)
def extract(s):
 global choices
 res = process.extractBests(s, choices, scorer=fuzz.token_sort_ratio, score_cutoff=80 , limit=10000)
 r = [x[0] for x in res]
 choices = [x for x in choices if x not in r]
 return list(df.userinput.isin(r))
i=1
ratio = 1/4.0
while len(choices)>ratio * len(choices):
 i=i*2
 choices
 choices = list(df.Resolution_Text_ascii)
 masks = df.Resolution_Text_ascii[:i].apply(extract)
masks.apply(np.count_nonzero);

Also here is a possible branching version, but not as clean as I would
like it where it would return the group it’s a member of.

def extract(s):
global choices
if s in choices:
res = process.extractBests(s, choices, scorer=fuzz.token_sort_ratio, score_cutoff=80 , limit=10000)
r = [x[0] for x in res]
choices = [x for x in choices if x not in r]
return list(df.userinput.isin(r))
else:
return list()

Async Calls in KDB/Q

Sometimes you don’t need a long post to explain something relatively complicated.

KDB/Q has something called inter-process communcation (IPC) it allows two different independent KDB/Q processes to communicate. This is built-in so we don’t need to dig deep.

Usually every result in KDB/Q returns synchronously, meaning a verb[noun] will immediately be processed and no other work can get done during this time.

However, in general it is much more useful to allow to processes to talk asynchronously, like email as opposed to a phone call. IE get it done when you can please.

KDB/Q provides such a construct, but it’s still a bit tricky. So here is a version that works.

Start a KDB session with the -P or run \P in the session followed by a port number (5000 is the convention).   This session is now listening on that number.

In a different terminal start a session and create the handle that will talk to that listener. Assume we are listening on port 5001, we can use either the hostname or ipaddress, if we leave it blank it will assume localhost.

h: hopen `:hostname:port

If we use neg[h] “something that will run” it will send it asynchronously, but we won’t get a response with the result of what happened. So…

I made a very simple utility that lets the process you send to reply asynchronously with a result and that variable will get set.

async:{[h;v;n;r] neg[h] ({neg[.z.w] (x set y)}[r;v[n]])}

The verb takes a handle (h), a verb (v), a noun (n) and a result symbol (r). It uses the handle to send an asynchronous request. The request asks the other process to apply the verb to the noun and send back the result and set it to the result symbol.

Here is a really simple example:

q)async[h;{x*x};4;`test]
q)test
16

We can also project this verb so that it creates a verb that has an (i)mplicit handle and result.

iasync:async[h;;`result]

iasync [v; n]

That’s it. Enjoy.

 

UPDATE: I have found a more useful async is one that allows passing in an arbitrary function. This allows us to use the words set to perform that task done here.

the new async looks like this:

async:{[q;cb] neg[.z.w] (cb,value q)}

It simply says place the query into a list of it’s own and the cb into a list of it’s own. the callback will be called with the value of the query evaluated on the service.

For example:

neg[h] (async;({0N!x+x};2);(set;`a))

Will set `a on your service to the value of the query in this case 4. and display 4 on the service.  We can also define a function on our service that takes some number of arguments fill them in now and have the function be called with the result when the callback is executed.

f:{0N!2+x+y}
q)neg[3i] (async;({0N!x+x};2);(`f;3))
7