Vector thinking for Max Profit

Occasionally I am lucky enough to run into a problem that doesn’t already have many vector solutions, so I can demonstrate how vector thinking can allow you to find an elegant solution. To build up to my insight let’s solve 2 easy variations and then we can get to the two more interesting puzzles.
The problem in question is called Best Time to Buy and Sell or more popularly simply Max Profit.
The theme is always the same you are given an ordered list of positive numbers that represent the evolution of the price of the security. You must find the maximum profit you could have made given this price series. You can only hold 1 unit at a time and you can not short, which means you must buy before you can sell (you can only profit from the price evolving upward).
The simplest and most well know variation is the 1 transaction problem (leetcode problem 121): you can only buy and sell at most 1 time what is the max profit that can be earned from the following series:
example series:
3 2 5 1 3 2 4 9 5 12 3 5
The key insight into this problem is that we are looking for the largest difference between the current element and the smallest element we have seen.
To motivate this insight, let’s look at ever larger versions of this problem starting with just one price to detect the pattern.
3 -> 0
Given just one price we can only buy and sell on the same day, so effectively max profit is 0
3 2 -> 0
Given these two points we can still only earn 0, since we must sell after we buy.
3 2 5 -> 3
Now that we have a higher number after a lower number we can plainly see that of the two options buy at 3 or buy at 2, buy at 2 and sell at 5 is better.
3 2 5 1 -> 3
The answer is still 3, since we can’t sell higher than 5, and even if we bought at 1 at the end there are no days left to sell
3 2 5 1 3 -> 3
We still do best by buying at 2 and selling at 5, buying at 1 and selling at 3 only earns 2.
3 2 5 1 3 2 -> 3
Plainly the 2 at the end doesn’t improve
3 2 5 1 3 2 4 -> 3
We can now either buy at 2 and sell at 5 or buy at 1 and sell at 4, but our max profit is still 3.
3 2 5 1 3 2 4 9 -> 8
Finally, the max profit changes! We can now buy at 1 and sell at 9. As new elements are added the new reward will be a function of the lowest element seen thus far.
We then get our solution:
{max x-mins x} 3 2 5 1 3 2 4 9 5 12 3 5
11, where we buy at 1 and sell at 12.
Or in python we could say something like:

import itertools
from operator import sub
def max_profit(p):
    return max(map(sub,p,itertools.accumulate(p,min)))

To unpack this a bit, mins returns the min element up to that point:
mins[3 2 5 1 3 2 4 9 5 12 3 5]
3 2 2 1 1 1 1 1 1 1 1 1
we can then do element-wise subtraction to see what the profit is from selling at each point
0 0 3 0 2 1 3 8 4 11 2 4
taking the running max we get the max profit so far,
0 0 3 3 3 3 3 8 8 11 11 11
we can see that this series is exactly the results that we got when we were looking at the max profit for the prefixes of the series.

Easy enough, let’s look at version 2 of this problem (leetcode 122):
unlimited number of transactions: We are now allowed to buy and sell as often as we like provided we never own more than 1 share of the stock and we still must sell after we bought. To make things easier we can assume that we can buy and sell on the same day, in practice this doesn’t matter as any day we do this we can consider that we did nothing.
Keeping the same price series as before:
3 2 5 1 3 2 4 9 5 12 3 5
We now notice that we should take advantage of every single time the price goes up. Again let’s look at a smaller example to get some intuition.
3 2 5 -> 3
Nothing fancy here, we buy at 2 and sell at 5, purchasing at 3 doesn’t improve our profit since selling at 2 would incur a loss.
3 2 5 1 3 -> 5
buy at 2 sell at 5, buy at 1 sell at 3
3 2 5 1 3 2 4 -> 7
just add one more purchase at 2 and sell at 4,
3 2 5 1 3 2 4 9 -> 12
Here it becomes interesting, we can look at two interpretations
buy at 2 sell at 5 (3),buy at 1 sell at 3 (2), buy 2 sell at 9 (7) for a total of 12
or we can say:
buy at 2 sell at 5 (3),buy at 1 sell at 3 (2), buy at 2 sell at 4 (2),buy at 4 sell at 9 (5) for a total of 12.
The two approaches are identical since addition commutes, it doesn’t matter how you get from 2 – 9 you will always earn 7.
Which means that we can simply add up the positive steps in the price series. That will be the maximum profit for an unlimited number of transactions:
Our code is simply:
{(x>0) wsum x:1 _ deltas x} 3 2 5 1 3 2 4 9 5 12 3 5
Deltas gives us the adjacent differences. We drop the first delta since we cannot sell before we have bought. If the delta is positive >0 we want to include it in our sum. Using (w)eighted(sum) we weight the >0 with a 1 and less then 0 or =0 are given a weight of 0.

Now that we covered the case with 1 transaction and unlimited transactions, we should feel confident that we can tackle 2 transactions. Lucky for us, that’s exactly version III of the problem (leetcode 123):
Getting insight into how to solve this variation should start with thinking about how we could solve this naively. The correct heuristic to reach for is divide and conquer. Suppose that for every step along the price evolution we knew what the max profit was before and the max profit was after. Then we could sum those two max profits and take the largest combination.

assume max_profit function as before:

max([max_profit(p[0:i])+max_profit(p[i:]) for i in r)])

For our example this gives us 15.

What we might notice is that we are recomputing the max profit over for larger and larger prefixes, and for smaller and smaller suffixes.
We know that the prefixes were already computed simply by taking the rolling max instead of the overall max in max_profit. The only thing we’re missing is how to calculate the max_profit of suffixes.
Now we notice a symmetry – while the prefixes are governed by the rolling minimum, the suffixes are bounded by the rolling maximum from the left, i.e. the largest element available at the end to sell into.

To get a sense of this, look at the solutions to the suffixes
3 5 -> 2 (buy at 3 sell at 5)
12 3 5 -> 2 (buy at 3 sell at 5)
5 12 3 5 -> 7 (buy at 5 sell at 12)
9 5 12 3 5 -> 7 (buy at 5 sell at 12)
4 9 5 12 3 5 -> 8 (buy at 4 sell at 12)
Which leads to this solution in q:

{max reverse[maxs maxs[x]-x:reverse x]+maxs x-mins x}

maxs x -mins x / is familiar from earlier
If we want to walk backwards through a list, we can reverse the list and apply the logic forwards then reverse the result, which is the same as applying the logic to the suffixes.
So all we need to do is to subtract each element from the running maximum then take the running max of this result to get the max profit for the suffix so far.
In Python this simply becomes:

from itertools import accumulate as acc
from operator import add,sub
def max_profit_2trans(p):
    before=list(acc(map(sub,p,acc( p,min)),max))
return max(map(add,b,a))

Alright, finally we should be able to tackle the most general version of this problem (leetcode 188). You are given both a series of prices and allowed to make up to k transactions. Obviously, if k is 1, we solved that first. We just solved k is 2. You can verify for yourself, but if k is larger than half the length of the price series the max is the same as the second problem we solved. Since every pair of days can allow for at most one transaction, effectively k is unlimited at that point.
We need to solve the k>2 but less than k<n/2. The standard CS technique that comes to mind is some kind of dynamic programming approach. To quote MIT’s Erik Demaine CS6006 dynamic programing is recursion plus memoization.
Let’s setup the recursive solution and assume we are at a particular (i)ndex in our price series, with k transactions left.
Let’s start with the base cases:
If k equals 0 we return 0
If i is greater than the last index, i.e. there are no elements left in the list: we return 0.
Otherwise the solution to the problem is simply the maximum of 2 options:

do nothing at this step:

0+the function increment i

do something:

If we are (h)olding a share we can sell at this step which adds the current price +the result of this function with one less k and i incremented by 1.
Otherwise we buy at this step, which is minus the current price (we spend money to buy) plus the result of this function with i incremented and we are now holding a share.

Here is the code for this in q:

p:3 2 5 1 3 2 4 9 5 12 3 5
cache:(0b,'0,'til[count p])!count[p]#0
f:{[h;k;i]$[i=count p;0;(h;k;i) in key cache;cache[(h;k;i)];
:cache[(h;k;i)]:.z.s[h;k;i+1]|$[h;p[i]+.z.s[0b;k-1;i+1];.z.s[1b;k;i+1]-p i]]}

We can test this and see that it results in the right answer for k=0,1,2
f[0b;0;0] -> 0 no surprise with 0 transaction no profit is possible
f[0b;1;0] -> 11 the original problem
f[0b;2;0] -> 15 buy at 1, sell at 9, buy at 5 sell at 12

However, we know that a vector solution is possible for k=1,2,infinity, so we might hope there is a vector solution for when k is 3 and above.
We can analyze the intermediate results of the cache to get some sense of this.

code to look at the table

t:(flip `h`k`j!flip key cache)!([]v:value cache)
exec (`$string[j])!v by k from t where not h 

output table:

 | 11 10 9 8 7 6  5  4  3  2  1  0 
-| --------------------------------
0| 0  0  0 0 0 0  0  0  0  0  0  0 
1| 0  2  2 7 7 8  10 10 11 11 11 11
2| 0  2  2 9 9 12 14 14 15 15 15 15
3| 0  2  2 9 9 14 16 16 17 17 18 18
4| 0  2  2 9 9 14 16 16 18 18 20 20

What we might notice is that the first row is the running maximum from selling at the prefixes. Then it’s the combination of the previous row and 1 additional transaction. In other words, each row allows you to spend money from the previous row and sell at the current price.
Putting this into action we get the following:

{[k;p]last {maxs x+maxs y-x}[p]/[k;p*0]}

We can look at intermediate rows by writing the scan version of the code and we know it will converge to the unlimited transaction case:

{[p]{maxs x+maxs y-x}[p]\[p*0]} p
0 0 0 0 0 0 0 0  0  0  0  0 
0 0 3 3 3 3 3 8  8  11 11 11
0 0 3 3 5 5 6 11 11 15 15 15
0 0 3 3 5 5 7 12 12 18 18 18
0 0 3 3 5 5 7 12 12 19 19 20
0 0 3 3 5 5 7 12 12 19 19 21

What’s really nice about this code is that it actually describes the problem quite well to the computer and in effect finds exactly what we want.
p*0 / assume we have 0 transactions and therefore 0 dollars at each step.
y-x / subtract the current price(x) from the current profit(y) element wise
maxs y-x/ find the rolling max revenue
x+maxs y-x / add back the current selling price
maxs x + maxs y-x / find the rolling max profit
repeating this k times finds the max profit of the k’th transaction.

Here is the python code that implements this logic:

import itertools
from operator import add,sub
def max_profit(k: int, prices: List[int]) -> int:
        if k==0 or len(prices)<2:
            return 0
        maxs=lambda x: itertools.accumulate(x,max)
        addeach=lambda x,y: map(add,x,y)
        subeach=lambda x,y: map(sub,x,y)    
        for i in range(k):
        return list(z)[-1]

I think this post covers this puzzle, let me know if there are interesting variations I haven’t explored.


Stable Marriage in Q/KDB

A friend of mine is applying to Medical Residency programs, which motivated me to investigate the matching algorithm used to match residents with hospitals. NRMP the organization behind the matching process has a dedicated video explaining how the algorithm works. The simplest version of the problem can be thought to consist of n guys and n gals who are looking to pair off. Each gal has a ranked list of guys and each guy a ranked list of gals. We want to find a pairing such that no guy and gal can pair off and be better off. We call this stable, since any gal who wants to leave can’t find a guy who isn’t currently happier with someone else. It takes two to tango, or in this case disrupt a stable pairing.

Gale-Shapley published a result on this problem in 1962 showed a few amazing properties:

  1. A stable pairing always exists
  2. There is an algorithm to get to this pairing in O(n^2)
  3. It is truthful from the proposing side, i.e. if men are proposing then they can’t misrepresent their preferences and achieve a better match
  4. The proposing side gets the best stable solution and reviewing side gets the worst stable solution
  5. The reviewing side could misrepresent their preferences and achieve a better outcome

Knuth wrote about this problem pretty extensively in a French manuscript. The University of Glasgow has become a hub for this problem, Professor Emeritus Robert Irving wrote the book on the topic and his student David Manlove has written another. New variations and approaches keep coming out, so in a way this problem is still an active area of research. For this post I want to look at only the simplest version and describe a vectorized implementation.

First let’s walk through an example by hand (note: lowercase letters are men)

Men ranking women

d| `D`B`C
b| `B`C`D
c| `C`D`B

Women ranking men

D| `b`c`d
B| `b`c`d
C| `d`b`c

First we pair off d,D
Then we pair off b,B
Finally we pair c,C

If we had started from the women:

First we pair D,b
Then we pair B,b and drop D to be unpaired, because b prefers B to D
Then we pair D,c
Finally we pair C,D

So we see there are two stable pairings. B always pairs with b, but either the men (d,c) or the women (D,C) will get their worst choice.

The algorithm:

Propose a pairing and engage the pair:

  • If the reviewing party (woman) prefers the current proposing party to her current engagement:
    1. Jilt the current party (A jilted party goes back into the pile of unengaged proposing parties, with all the parties he has proposed to so far crossed off)
    2. Become engaged to the proposing party (A reviewing party always prefers any partner to being alone)
  • If the current proposal did not work:
    1. Continue down the ranked list of the proposing party until they find a match, i.e. go back to ste

Here is how we can implement the naive version of this algorithm in Q/KDB:

Here is how we can implement the naive version of this algorithm in Q/KDB:
i:3 /number of each party
c:0N /number of ranked individuals on each list 
     /(null takes a random permutation, convenient since we want everyone to rank everyone) we can also use neg[i]
m:neg[i]?`1 /choose i random symbols with 1 char to represent each possible man
w:upper m /assign same symbols but upper case as list of all possible women
dm:m!(#[i;c]?\:w) /create a dictionary of male preferences
       / each man chooses a random permutation of women
dw:w!#[i;c]?\:m /each women chooses a random permutation of men
t:1!([]w:w;m:`;s:0W) /our initial pairing table, each women is unpaired, 
	          /so her current score is infinity, 
	          /the table is keyed on w, so that a women can only have 1 match
 /ifin returns the position in the list like find(?), 
 /but returns infinity if not found instead of length of list
 /ifin[`a`b`c;`b] --> 1 
 /ifin[`a`b`c;`d] --> 0W 
The algorithm
 if[null cm:first key[dm] except (0!t)`m;:t];
 im:first where t[([]w:k);`s]>s:dw[k:dm cm]iifin\: cm;
 t upsert (k im;cm;s im)};

f takes a pairing table of the current pairings. It then looks for the first unmatched man. We call this (c)urrent (m)an, if current man is null, we are done all men are matched, return t, our pairings.

Otherwise, find the first woman on his list of ordered rankings whose happiness (s)core he can improve. We are using iifin to see where his rankings score from the women’s side, and by choosing the first such match, we are preferencing his choice. We then insert this pair into the pairing (t)able, along with the happiness (s)core.
f over t /repeat the process of adding pairs into the table until all men are matched.

Before we look at the vectorized (batched) version, there are some simple optimizations we can make. For example, since we notice that we will be calculating how much a woman likes a man back when he chooses her, we can pre-tabulate all pairs in advance.

Here is the code that creates a table that will have every man and woman pairing with both his and her score next to it:

/first convert the dictionary to a table that has each man and his rankings
/along with a column which ranks the match, 
tm:{([]m:key x;w:value x;rm:(til count ::)each value x)} dm
/looks like 
m w     rm   
m E M P 0 1 2
e P M E 0 1 2
p M E P 0 1 2
/then ungroup looks like this:
tm:ungroup tm 

m w rm
m E 0 
m M 1 
m P 2 
e P 0 
e M 1 
e E 2 
p M 0 
p E 1 
p P 2 

repeat the procedure with women's table. 

tw:ungroup {([]w:key x; m:value x;rw:(til count ::) each value x)} dw

/now left join each match 
ps:tm lj `w`m xkey tw
m w rm rw
m E 0  2 
m M 1  0 
m P 2  1 
e P 0  2 
e M 1  2 
e E 2  1 
p M 0  1 
p E 1  0 
p P 2  0 

Now that we have pre-computed all the scores we can look at the vectorized version:

First, we notice that we can technically make proposals from all the men in one batch. There is one caveat, if two or more men rank one woman as their top choice during a matching batch the tie is broken by the woman. So we can grab each man’s current first choice and break all the ties with that woman’s ranking.

Second, we notice that any woman who is currently engaged will stay engaged unless a man she desires more becomes available. This means we can eliminate all women who are already engaged to someone they prefer more from each man’s ranking. In other words, if Alice is currently engaged to Bob, and she prefers only Carl to Bob, Sam doesn’t stand a chance and we can eliminate Alice from Sam’s ranking completely.

These two tricks are all it takes to vectorize(batch) the solution.

Now let’s look at the code:

Here is the algorithim in one block:

imin:{x?min x} /get the index of min element
smVec:{[ps;t]ms:exec last s by w from t;
 update check:(check+1)|min rm by m from ps where not m in (0!t)`m,rw<ms[w];
 p:select m:m imin rw, s:min rw by w from ps where rm=check;t upsert p}

Let’s break it down statement by statement:

/ smVec takes a data table which are our (p)recomputed (s)tatistics 
/ it also takes the pairs table t, which is the same as above
 ms:exec last s by w from t;
/ms is our current best score for each woman who is currently engaged
 update check:(check+1)|min rm by m from ps where not m in (0!t)`m,rw<ms[w];
/ For each man we keep a counter of who we should check next, 
/ we increment this each time he is not yet engaged (not m in (0!t)`m) 
/ so we look at his next viable candidate, 
/ also we remove any candidates that he cannot satisfy (rw<ms[w]). 
/ He needs a woman who prefers him to her current match.
/ So we take the max of either the next candidate 
/ or the next one whose score would go up if he matched.
 p:select m:m imin rw, s:min rw by w from ps where rm=check;
/ now we simply select all the ones to check and break ties by min rw,
/ ie the woman's rank.
 t upsert p}
/lastly update/insert current matches into the pairings table

We need a stopping condition so that we can stop iterating:

Originally, I thought we need a condition that said something like “as long as there are unmatched men, keep iterating. {not all key[dm] in (0!x)`m}”

pairings:smVec[update check:-1 from ps]/[{not all key[dm] in (0!x)m};t]

As it turns out, once we remove all the clutter and always append only viable matches this turns out to be unnecessary. We can simply iterate until convergence which is the default behavior of over(/). So running the function looks like this:

pairings:smVec[update check:-1 from `ps] over t

We are done, but we can extend this one step further by allowing for incomplete preference rankings. The canonical way to treat incomplete ranking is to assume that a person would rather be a widower than be married to someone he didn’t rank.

To do this, we simply add the person matching with himself at the end of every ranking and add a shadow woman which is just the man ranking himself first. This means that once we get to the end of his ranked list he will automatically match with himself. With this small addition to the data, the complete setup function looks like this:

setup:{[i;c]c:neg c;m:neg[i]?`4;w:upper m;
 dw,:(m!0N 1#m);t:1!([]w:w,m;m:`;s:0W);
 tm:{([]m:key x;w:value x;rm:(til count ::)each value x)} dm;
 tw:{([]w:key x; m:value x;rw:(til count ::) each value x)} dw;
 ps:ej[`m`w;ungroup tm;ungroup tw];
 `dm`dw`t`ps set' (dm;dw;t;ps)}
setup[1000;0N] /ON means rank every member. 
pairings:smVec[update check:-1 from `ps] over t

Now we can verify that we have the right solution by comparing to the official python implementation

/write out some test data for python
pairings:smVec[update check:-1 from `ps] over t
`:m.csv 0: "," 0: ungroup {([]m:key x;w:-1 _' value x)}dm
`:w.csv 0: "," 0: ungroup {([]w:key x;m:value x)}count[dm]#dw

/python solve: (we need to increase the recursion limit in python because the library can't solve large problems without it, also I wasn't able to run problems bigger than 2000 on my machine without a segfault)
import pandas as pd
from import StableMarriage
import sys
pd.DataFrame([(k, v) for k, v in s.items()], columns=["m","w"]).to_csv("mwsol.csv", index=False)

pysol:("SS";enlist ",") 0: `mwsol.csv
pysol:pysol lj update mm:m from pairings
0=count select from pysol where mm<>m

If we join on the women and our men match the python men, then we have the same answer.

Lastly, a note on performance: we get a 3x performance boost from vectorization
The naive version solves the
\t pairs:f over t
5000 -> 185771 (3 min)
2000 -> 15352 (15 sec)
1000 -> 2569 (2.5 sec)
\t smVec[update check:-1 from `ps] over t
5000 -> 59580 (1 min)
2000 -> 4748 (5 sec)
1000 -> 1033 (1 sec)

The python version takes 20 seconds when n is 1000 and I couldn’t test higher.
Which is about 7x slower than naive in q, and 20x slower than vectorized.

Further the python library doesn’t allow for incomplete preference lists. (we can create complete lists that have sentinels values where the list ends, but the memory usage is prohibitively expensive for large problems). Lastly, the python library runs into the recursion limit, which we avoided by iterating over a state data structure instead of using the stack.

Meanwhile in practice there are about 40k med students a year and each only ranks less than 100 choices. Which means we can solve that problem really quickly.
\t pairings:smVec[update check:-1 from `ps] over t
1465 (1.5 seconds)

In the next post, I will describe how to extend this solution to apply the stable marriage problem to resident hospital matching problem where each hospital may have more than 1 slot.

It’s Flooding on all sides!

This post is inspired by a happy coincidence, I was doing p2 of day 14 of advent of code for 2017 and I had never formerly encountered the flood fill algorithm. This confluence of events inspired me to come up with a new way to solve this classic problem. 

You are given a boolean matrix and told that points are neighbors if they are either next to each other horizontally or vertically, but not diagonally. If point (a) and point (b) are neighbors and point (c) is a neighbor of point (b) then (a),(b) and (c) are in the same neighborhood. A point with no neighbors is a neighborhood of 1.  You are asked how many such neighborhoods exist. 

Shamelessly borrowing the picture from adventofcode:

|      |   
V      V   

You can see that this piece of the matrix, has 9 neighborhoods. Points that are off in the boolean matrix are simply ignored. 

This problem is a classic because it is apparently how that bucket fill tool works in every paint program. Essentially you keep filling all the pixels that are neighbors until you are done. 

Now the classical way to solve this problem is flood fill. Essentially the problem is broken into two pieces:

  • An outer loop that visits every point
  •  An inner loop that looks for neighbors using breadth first search or depth first search

Here I will annotate the Q code that achieves this result.

ff:{[m] /this is flood fill function
	/ returns a table with all the distinct neighborhoods
	/m is a matrix
	/c is the number of rows
	/columns this matrix has, we assume it's square
	c:count m;
	/we flatten m; m is now a list of length c*c	
	m:raze m; 
	/we create t, a table that is empty
          /and has columns id, and n 
	 /id will be the id of the neighborhood
	 /n will be a list of neighbors in that hood
	/gni is a helper function that get neighbor indexes
	/gni will accept the index as an offset 
	/ from the flattened matrix 
	/and will return coordinates as flat offsets.
	/internally it needs to represent 
             /the coordinates as a pair,
	/ we use standard div and mod to represent the pair 
	gni: { [x;y]
	  /x is the size of the matrix it will be fixed to c
	  /y is the point
	  /s whose neighbors we want to find
	  /ny are the new col coordinates
	      / found by dividing the point by c
	      / we add the offsets 
	      / but makes sure we are within boundaries of matrix
	      / min 0  and max c-1
    ny:0|(x-1)&0 -1 1 0+/:y div x;
		  /repeat this process to calculate row coordinates
    nx:0|(x-1)&-1 0 0 1+/:y mod x;
      /get back original flattened index
       /by multiplying ny by c and adding nx
      /flatten list and only return distinct points
	 distinct raze nx+x*ny
	 /project function onto c, since it will be constant 
	 / for the duration of the parent function 

   /f is our breadth first search function
   /we start by adding i, which is our first point, 
   /we then find all the gni of i,
   /we remove any gni[i] that are already present in i
   /and intersect the new points that are on (=1)  
   /by intersecting with j
   /we project this function on gni, 
   / which is called gni inside as well.
   /we also project on the points that are on 
   /(are equal to 1) aka (where m)
   / and which are named j inside the function)
	 f:{[gni;i;j](),i,j inter gni[i] except i}[gni;;where m];
   /repeating f over and over will yield
   /  all the neighbors of a point that are on (=1)

   /now we get to the outer loop, 
    /which will check 
    /if we have seen the current point  (p)
   /if we have seen it, we ignore it,
    /and return our current table (t)
   /otherwise, we call f on that point and return 
    /the point and all of it's neighbors
   /this will end up with our new table t, 
   /we only call this function on the points 
   	/that are turned on, 
   /which means we can skip checking if the point is on  
  /we will repeat this for all the points equal to 1      
	t:{[f;m;t;p]$[not p in exec n from t;
						t upsert (p;f over p);
						t]}[f;n]/[t;where m];

The entire code without comments looks like this, and is pretty short.

	c:count m;m:raze m; t:([id:()]n:());
	gni:{distinct raze (0|(x-1)&-1 0 0 1+/:y mod x)+x* 0|(x-1)&0 -1 1 0+/:y div x}[c];
	f:{[gni;i;j](),i,k where j k:gni[i] except i}[gni;;m];
	t:{[f;m;t;p]$[not p in raze exec n from t;t upsert (p;f over p);t]}[f;n]/[t;where m]

With that out of the way, I now want to explain an alternate way of achieving something essentially identical. 

A common technique in math and engineering fields is if you can’t solve a problem, find a similar problem that is easier that you can solve and solve that first.

So I pretended that all I needed to do was to find runs, which is a set of consecutive 1s in a row. I figured if I do this for all the rows and all the columns, I’ll have a starter set of neighborhoods and I can figure how to merge them later. 

Finding a run is really simple:

  1. set the neighborhood id to 0
  2. start at the beginning of first row
  3. look at each element,
  4. if you see a 1 add element to the list that corresponds to your current neighborhood id
  5. if you see a zero increment the neighborhood id
  6. if next row, go to the next row, increment neighborhood id, otherwise go to step 8 
  7. go to step 3
  8. remove neighborhood ids with no elements in them. 

In q, you can take a bit of a shortcut by asking each row, for the indexes with a 1 in them. So for example:

q)where 1 0 1 1 0
0 2 3
q) deltas where 1 0 1 1 0
0 2 1

One thing we notice is that consecutive differences  of 1 represent runs and those with any other number represent cut points, and if we cut at those points, we would get  (,0; 2 3) which is a list that contains two lists, each inner list is a run in that row. We can repeat this for the columns and create all the runs that exist row wise or column wise. 

Now we just need some logic to merge these runs. Here is an annotated version of the code that implements this method of finding all the neighborhoods. 

rr:{[m] /This will return all the points in the matrix that are on and their assigned neighborhood id
	/m is the square matrix
	c:count m; /c is the length of a side of the matrix 
	m:raze m; /m is now the flattened list that corresponds to the matrix
	/runs is a function that will return a list of lists, 
	/each inner list corresponds to points that are adjacent
	/Though the mechanism to calculate adjacency is in terms of y
	/It will return the Xs that correspond to those Ys. 
	/eg if we have a row 0 1 1 0 1 0 1 1, and the points are labeled `a`b`c`d`e`f`g`h
	/runs[0 1 1 0 1 0 1 1;`a`b`c`d`e`f`g`h] => (`b`c;,`e;`g`h)
	runs:{(where 1<>-2-': y) _  x}; 
	/now we want to apply runs to every row and column
	/n will be our initial list of neighborhoods,
	/ each neighborhood is a list that will contain adjacent elements either by row or by column
	/ elements are given a unique id
	/i will be our unique id , it is the index of the point that is turned on (=1)
	i:where m;
	/j groups all 'i's that are in the same row, by using integer division
	/so j is a dictionary whose keys are the row index and whose values are the 'i's 
	/we need to do this grouping to ensure that only points in the same row are considered adjacent
	/ otherwise we might believe that a diagonal matrix has runs (because we only see the column indexes)
	j:i group i div c;
	/since runs returns x, but uses 'y's to do the splitting, we provide j mod c as y
	/j mod c corresponds to the column location of each i
	/we then  apply runs on (')each j and j mod c. 
	/finally we flatten the result with a raze so that we just a have the list of row neighborhoods
	n:raze runs'[j;j mod c]; 
	/now we will repeat with the columns
	/we want to permute the list i so that the elements appear in column order
	/we want to preserve the labels we have in i
	/so we generate a list that is the length of m (c*c),
	/we reshape (#) to be a square matrix, and transpose it along the diagonal (flip)
	/we flatten(raze) it back out and intersect it with i, 
	/this preserves the column order we achieved from doing the flip, 
	/but only keeps the elements that are in i
	/k are the elements of i permuted along columns
	/eg if we have a matrix 
	//           0 0 1 
  	//           1 0 0
  	//           0 1 0
  	// i would be 2 3 7
  	// and k would be 3 7 2
	k:(raze flip (c;c)#til c*c) inter i);
    /now we will create the j  again which is our grouping 
    /but this time according to the column index so we will group by the integer remainder (mod)
    /this will separate our indexes into columns
    j:k group (k mod c);
    /then we will apply our runs function on each column
    / we use the div to get the row index of each element, 
    / so that runs can tell if two points are vertically adjacent
    / we flatten the result and append it to n
    n,:raze runs'[j;j div c];
    / at this point we have a list of all possible neighborhoods, but there are overlaps
    / so we create a table t, which will assign an id for each neighborhood contender
    / ungroup flattens the table so that every element belongs to a neighborhood id
    // here is a sample of what the table looks like before and after ungroup
    / before ungroup:
	/				n         id
	/				------------
	/				0 1       0 
	/				,3        1 
	/				,5        2 
	/				8 9 10 11 3  
	/after ungroup
	/				id n 
	/				-----
	/				0  0 
	/				0  1 
	/				1  3 
	/				2  5 
	/				3  8 
	/				3  9 
	/				3  10
	/				3  11
    t:ungroup `id xkey update id:i from ([]n);
    / f is our merging function.
    / It will merge neighborhoods together, 
    / by giving them the same id if there are points in common.
    / First we will update/create a column p which is the minimum id for a given element (n)
    / Then we will update all of the ids the with minimum p and rename that the new id
    / If two neighborhoods share a point in common, that point has two neighborhood id's
    / we will assign it primary id (p) which is the min of those two ids
    / we then group by id and assign the id to be the min of the primary ids (p)
    f:{[t]update id:min p by id from update p:min id by n from t};
    /On each iteration of this function we will merge some neighborhoods, 
    / we repeat until we can't merge anymore
	f over t}

The uncommented code is much shorter and looks like this:

c:count m;m:raze m; runs:{(where 1<>-2-': y) _ x};
n:raze runs'[j;(j:i group (i:where m) div c) mod c];
n,:raze runs'[j;(j:k group (k:(raze flip (c;c)#til c*c) inter i) mod c) div c];
t:ungroup `id xkey update id:i from ([]n);
f:{[t]update id:min p by id from update p:min id by n from t};
f over t}

At this point we are done and we have solved the puzzle in two slightly different ways. Looking at the performance of both methods, reveals that they are both O(n) essentially linear in the number of elements. However, the row method is actually faster, because everything is done as nice straight rows, so there is not much jumping around, in other words it is more cpu cache friendly. On my late 2011 Mac: for a 128 by 128 matrix
\t rr m
176 milliseconds
\t ff m
261 milliseconds

So it is about 66% faster

As a bonus, my father suggested that I expand this code into n dimensions. This is rather easy to do and I include the code here without too many comments. The trick is taking advantage of KDB’s vs (vector from scalar) function that can rewrite an integer as a number in another base:
so for example 3 vs 8 => (2;2) which are the coordinates of the point 8 in a 3 by 3 matrix. The point 25 is 2 2 1, in other words it’s in the last layer of a 3 dimensional cube of size 3, last column and first row. 
This means we can easily go back and forth between integer representation of the matrix and coordinate representation. Everything else is pretty much the same. 

c:count m;r:1; while[(count m:raze m)<>count[m];r+:1]; t:([id:()]n:());
gniN:{[r;c;n]a:neg[a],a:a where 1=sum each a:(cross/)(r;2)#01b;
distinct raze -1 _ flip c sv flip (c-1)&0|a +\: c vs n,-1+prd r#c}[r;c];
f:{[gni;i;j](),i,j inter gni[i] except i}[gniN;;where m];
ps:where m;
while[count ps;t:t upsert (first ps;k:f over first ps); ps:ps except k];
/functional version is slower
/t:{[f;t;j]q+::1;$[not j in raze exec n from t;t upsert (j;f over j);t]}[f]/[t;where m]

c:count m;r:1; while[(count m:raze m)<>count[m];r+:1]; runs:{(where 1<>-2-': y) _ x};
j:(c sv (til r) rotate\: c vs til prd r#c) inter\: where m;
g:til[r] except/: d:reverse til r;
n:raze raze each runs''[j@'k;@'[v;d]@'k:('[group;flip]) each @'[v:flip c vs j;g]];
t:ungroup `id xkey update id:i from ([]n);
f:{[t]update id:min p by id from update p:min id by n from t};
f over t}

Controlling the Tower

I recently came across this problem and it looked interesting enough, so I solved it. It’s from Topcoder, which is nice because they provide test cases so you can be sure that you have solved it.

Unfortunately, Topcoder has dracanion copyright laws, so I will paraphrase the problem in case the link to the description dies. You want to control some territory, the territory has t (an integer) towers and each tower is a certain number of stories s (an integer). To control the territory you must control the majority of the active towers. A tower is active if someone controls the majority of stories in a tower. If a tower has no majority controller, it is inactive and does not count. What is the minimum number of stories you must win to guarantee control of the territory?

The first thing that helped me was converting the test cases into the following two lists and created a table:


q)select results, inputs from asc t  / just to give a visual sample

results inputs
1      1
2      2
2      3
3      4
3      5
4      6
24     47
2      1 1
2      1 1 1
4      1 1 1 1 1 1
699  1 1 100 100 100 100 200 200

That way, I could write a function and start testing and seeing if I was making progress.

If there is only one tower, then I obviously need to take it, which means I need to take the majority.

So that is pretty easy:

f{[x] if[count[x]=1; :sum 1+floor x%2]; :0} /one tower,
/ otherwise return 0 this is wrong but we want to return a number

I can run my function and see which cases I still haven’t solved:

q) select myResult, results, inputs from (update myResult:f each inputs from t ) where myResult<>results

This gave me a way to check if I was making progress.

Next, we know that we need to win the majority of the towers and we need to win them in the most expensive way possible in order to guarantee a victory because we don’t get to chose how our stories our distributed. So we sort the towers desc by height and fill up half of them. Then comes the tricky part, the opposition can have a simple majority in all the other towers. In that case, there will be one final tower that decides everything.

For example, suppose there are 5 towers, 6 6 5 5 4. We need to win three towers. So we win the first 2 by getting 12 stories. Then we only need one more tower but the opposition can strategically win the shortest towers. It’s important to note that winning a 4 story tower and a 5 story tower takes 3 either way. So they can do better by winning the two 5 story towers and simply stop me from winning the 4 story tower by taking 2 stories.

This explains the following slightly convoluted algorithm:

if[count[x]=1;:sum 1+floor x%2]; /one tower
t:floor count[x]%2; / majority of towers, one less if there are an odd number
v:sum t#x:desc x; /most expensive way to win t towers
r:reverse (t-count[x])#x; /remaining towers
o:sum 1+floor %[;2] t#r; /opposition needs to take at least t towers
o+:sum floor 0.5+%[;2] d _ r; /oppositions needs to prevent the rest
o-:(any(d _ r)mod 2)*(any not(d#r)mod 2) ;
/subtract if opposition wins during preventing
/ and you can strategically exchange an odd tower with an even one

If anyone has a simpler, more straightforward approach, or wants to tell me how my approach is wrong, I would love to hear about it.

Commutative Business Days

The concept of business days is pretty straightforward. For most people, business days are the 5 days a week we commute to work.

Questions like:

  1. What is the next business day?
  2. What was the previous business day?
  3. List the next 5 days.

are all pretty straightforward and I expect that the average high school graduate can answer it with the aid of a calendar (assuming it includes holidays).

For some reason, the implementations I’ve seen have been pretty hideous. So here is, in my opinion, an elegant solution that also demonstrates some neat features of KDB/Q.

The first step is to get all the business days in the year. (Depending on your region/industry you might have different days. I’ll use USA business days M-F with 10 government holidays (see US court business days).

/1 calendar year
year:2018.01.01+til 365
holidays:2018.01.01 2018.01.15 2018.02.19 2018.05.28 2018.07.04 2018.09.03 2018.10.08 2018.11.12 2018.11.22 2018.12.25
yearNoWeekend:year where in[;2 3 4 5 6] year mod 7
buisnessYear:yearNoWeekend except holidays
Once you have the buisnessYear for the relevant period

we need to be aware that the calendar will not work correctly for data we have not loaded

We can create a data structure that will allow us to answer all these types of questions.
The data structure we will be using is a sorted dictionary.
We will have a dictionary, nDay (next day) and a pDay function (previous day).
We can look up any date in our dictionary/function to find the next or previous date.
We use a sorted dictionary so that when we look up non buisness days,
q will return the value of the closest next/previous date.
/create our dictionary and function,
/fill the last day in the next day list with the first day from 2019
nDay:`s#buisnessYear!`s#2019.01.02^next buisnessYear
pDay:{[nDay;d]p:nDay?d; ?[null p;?[nDay] nDay d;p]}[nDay]

/let’s find the next day July 3rd
nDay 2018.07.03
/as expected returns 2018.07.05
nDay 2018.07.04
/as expected also returns 2018.07.05

/let’s find the previous day
pDay 2018.07.05
/as expected returns 2018.07.03
pDay 2018.07.04
/as expected returns 2018.07.03

/let’s find 5th day after a given date
/as expected we get 2018.07.09
/lets find 5th day previous to a given date
/as expected we get 2018.06.25

/We can ask to get the next 5 days
/we get the given day and next 5 days
/2018.07.01 2018.07.02 2018.07.03 2018.07.05 2018.07.06 2018.07.09
/We can ask to get the previous 5 days
/we get the given day and previous 5 days in reverse order
/2018.07.01 2018.06.29 2018.06.28 2018.06.27 2018.06.26 2018.06.25

/We can ask for the next days for 5 dates
nDay 2018.01.14 2018.02.18 2018.05.27 2018.09.02 2018.10.07
/ we get 2018.01.16 2018.02.20 2018.05.29 2018.09.04 2018.10.09

/We can ask for the previous days for 5 dates
pDay 2018.01.16 2018.02.20 2018.05.29 2018.09.04 2018.10.09
/we get 
2018.01.12 2018.02.16 2018.05.25 2018.08.31 2018.10.05

Bonus Section, and the real reason for the title of the essay:

Here is a final piece that might strike you as a bit weird until you untangle it.

In General:

0b=(pDay nDay d)~(nDay pDay d) 

In other words, these functions don’t commute. This seems strange since nDay is simply addition by one and pDay is simply subtraction by one.

This was used by the mathematician Peano to show how you can get all the natural numbers. Simply start with 1 and keep adding 1. If you want whole numbers start with 0. If you want integers create the concept of subtraction and you can count your way to negative infinity.

This is called Peano arithmetic and it commutes. You can always get back to where you started by simply adding or subtracting one appropriately. The order in which you do so does not matter.

However, business days exhibit a certain property that prevents them from commuting. Namely, although any calendar day has a next business day, not all days are business days. Which means that non-business days are not represented intuitively. For example, the next business day of Saturday is Monday and previous business day of Monday is Friday, but the previous business day of Saturday is Friday and so the next business day of Friday is Monday. Depending on whether we started with pDay or nDay we get different results.

The missing weekend from the business days illustrates the reason why, in general, floating arithmetic does not commute. There are numbers that cannot be represented by floating point arithmetic which means that sometimes the order of additions and subtractions will make a difference to the final result.


Backtracking In Q/ word ladder

This is a variation on the word ladder puzzle. Here is the puzzle:

You are presented with a set of strings (assume they are unique). Each string can be of arbitrary length. Your goal is to find the length of the longest chain.
A chain can start from any string. Each following link in the chain must be exactly one letter shorter than the previous link, any character from any position from the previous link can be dropped, this new link must appear in the set of words given.

For example, if you are given the set  of strings (“a”,”ab”,”abc”,”abdc”, “babdc”,”dd”, “ded”)

“ded” -> “dd”
“babdc” -> “abdc” -> “abc” -> “ab” -> “a”

are valid chains. The longest chain’s length is 5.

First I will present python code that my friend wrote, it is very concise and has a kdb flavor (at least in my mind).


Python Code:

def word_list(array):
  # Storing the solutions for each word in a dictionary
  dict_array = {word:0 for word in array}

  # Sorting the word list by lengths
  array.sort(key = lambda s: len(s))

  # Finding solutions to each word by using the previous solutions
  for word in array:
    options = [0]
    for i in range(len(word)):
      word_minus_i = "".join([word[k] for k in range(len(word)) if k != i])
      #check if the word appears in list of words
      if word_minus_i in dict_array:
        # If it does add a link to that word
    dict_array[word] = max(options)
  return max(dict_array.values())+1

Now to a KDB version:

I had two versions of the code both rely on the same insight as the python code

The difference is in the post processing.

The key insight into this problem is that if you hash all the strings. Then if you create every possible one letter drop you can check if that string is in the legal set of strings, if it is you can record which longer string it maps to.

At that point you have a list of every connection between two strings. Since checking the hash takes constant time and generating every 1 letter missing string takes 1- number of letters in the string* number of unique strings. This processing takes linear time.

At that point, we have a graph, and at first I wanted to reuse the code from the previous post. I would generate all the connected components, the use those nodes to figure out how many different length strings there were. The connected component with the largest number of distinct length strings is going to be the longest chain and the number of distinct lengths is the total chain length. (this code is [3-6]x faster for small inputs (10,000, 100,000 words) than the python but is about the same speed once we get up to a set of 1 million words)

The second version, uses a quicker technique to just calculate the depth of the graph from every node.

Let’s start from the first version:
Let’s generate 10,000 strings (.Q.a is “abcdefghijklmnopqrstuvwxyz”)
We want strings of length between 1 and 20 (1+til 20)
We want ten thousand strings, so we take 10000 of the list 1 2 3 .. 20
We then sample the number from the alphabet (3?.Q.a) gives us a 3 letter string
We do this for each right (\:) number in the list
We only want the distinct words

q)words:distinct words:(10000#1+til 20)?\:.Q.a

Here we are sampling 10 of these words:


So we need a function that will generate the indexes of all the possible 1 letter drops of a word.
The most intuitive way I know to do this, is to create equal length boolean string with one 0 and rotate that 0 around the whole string. So for example in the 3 letter word case we have:
We then find the indexes where there is a 1 and that would gives all the 1 missing indexes.

/0b,(x-1)#1b creates a string of length x with 1 0 value
/rotate\: says to rotate this string for each right argument (which is just the range from 0 to x)
/where finds the indexes that are 1
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b}

Lets do a little processing on the list of strings, and put them into a table:

/sort the words by count of the number of letters and only store the distinct words in hwords

q)hwords:distinct words iasc count each words

/create a table, with the distinct words, the number of letters in the word, and a symbolic version of the word for fast lookup, with a unique attribute so that KDB knows to hash this list

q)show t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
wsym w c
s ,”s” 1
t ,”t” 1
j ,”j” 1
n ,”n” 1

/wsym contains a symbolic version of the word
/w is just the original word
/c is the length of the word

We then want to populate a dictionary with all the indexes, so we select the distinct counts of the words and run them through the allDrops function

q)drops:n!allDrops each n:exec distinct c from t /we select the distinct counts run the function and key the distinct counts to the result from the function.
1 | ,`long$()
2 | (,1;,0)
3 | (1 2;0 1;0 2)
4 | (1 2 3;0 1 2;0 1 3;0 2 3)
5 | (1 2 3 4;0 1 2 3;0 1 2 4;0 1 3 4;0 2 3 4)

We do this mostly because we know that we will need these indexes many times, but there are relatively few distinct lengths of words.

Now we calculate all the links:

/drops c will give us a list of lists of all the 1 letter missing possible indexes for each word.
/To show what this does let’s select by c so that we can see what it does for each c
select d:drops first c by c from t
c | d ..
–| ————————————————————————-..
1 | ,`long$() ..
2 | (,1;,0) ..
3 | (1 2;0 1;0 2) ..
4 | (1 2 3;0 1 2;0 1 3;0 2 3)

Because q will apply a particular list on indexes and preserve the shape, we just need to tell it, that each word should be projected onto all of these lists. This is done with the @’ this will pair off each list of lists with the words. Here is what that looks like , again grouping by c so that we can see what that looks like:

q)select by c from update d:(w)@’drops c from t
c | wsym w d ..
–| ————————————————————————-..
1 | x ,”x” ,”” ..
2 | cw “cw” (,”w”;,”c”) ..
3 | fvw “fvw” (“vw”;”fv”;”fw”) ..
4 | yoku “yoku” (“oku”;”yok”;”you”;”yku”)

We see that one letter words, become empty strings, two letter words become a list of words each 1 letter long. three letter strings become a list of 2 letter words.
Now we just need to find where these words are in our original table. If they are not there they will get the 1+last index, which is how q lets you know that a value is missing in a lookup. We will also cast all these strings to symbols so that we can do faster lookups (`$):

/col is the index of the original string and row are all the matches. We are modeling the graph using an adjacency matrix idea.
q)show sparseRes:update col:i, row:t[`wsym]?`$(w@’drops c) from t
wsym w c col row
h ,”h” 1 0 864593
r ,”r” 1 1 864593
f ,”f” 1 2 864593
w ,”w” 1 3 864593
/We see that no 1 letter string has any matches which is why all of the indexes are last in the table, this makes sense since one letter strings can’t match anything but the empty string and all of our strings have at least one character
/Again selecting by c so we can see a variety of lengths,
q)select by c from update col:i, row:t[`wsym]?`$(w@’drops c) from t
c | wsym w col row ..
–| ————————————————————————-..
1 | x ,”x” 25 ,864593 ..
2 | cw “cw” 701 3 11 ..
3 | fvw “fvw” 17269 541 275 357 ..
4 | yoku “yoku” 64705 864593 7279 2368 3524 ..
5 | vvbud “vvbud” 114598 864593 46501 864593

As we can see the row column is a list of matches and also contains false matches that go to the end of the table we would like to clean that up. We can do that using ungroup.
Ungroup will take a keyed table, and generate a row for each item in the list of records.
In our case it will create a col, row  record for each item in the row column.

q)show sparseRes:ungroup `col xkey sparseRes
col wsym w c row
0 h h 1 864593
1 r r 1 864593
2 f f 1 864593

Now we want to delete all of the fake links, so that will be wherever the row is the length of the original table:

q) show sparseRes:delete from sparseRes where row=count t
col wsym w c row
26 bj b 2 2
26 bj j 2 10
27 oj o 2 2
27 oj j 2 12

At this point we are done processing the data.  We have found all the connections between the words, all that is left is to find the connected components, which we know how to do from the previous post. I use a slightly different function that calculates the connected components on a sparse matrix. That is represented as a table with columns row and col.

neighbors: exec col from m where row in j; /now we are searching a table instead of a matrix
f:{n:exec col from y where row in .[_;x]; /new neighbors
x[0]:count x[1];x[1]:distinct x[1],n; /update the two pieces of x
x}[;m]; /project this function on the sparseMatrix
last f over (0; neighbors)};

points:til max m[`row]; /
while[count points;
i:first points;
connected,:enlist n:`s#n:distinct asc i,findConnectedSparse[i;m]; /add point in case it is island
points:points except n];

Given these two functions that work on sparse adjacency matrixes. The final step is:

q)max {count select distinct c from x} each t allComponentsSparse sparseRes

Let’s unpack this a bit:

sparseRes is our table of links.

allComponentsSparse will return all of the connected components in this graph.

`s#0 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 3..
`s#1 62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 2..
`s#2 29 40 66 91 108 112 114 125 135 154 167 168 184 211 226 245 267 281 291 ..
`s#3 69 77 94 102 109 110 119 129 159 183 192 196 212 214 216 220 224 247 259..

apply the table t to this will give us a table that is indexed on the connected components


So for each connected component we are getting a table that has only those strings that match.
let’s look at the first one:

q)first t allComponentsSparse sparseRes
wsym w c
q ,”q” 1
aq “aq” 2
oq “oq” 2
qw “qw” 2
wq “wq” 2
jq “jq” 2

We can see immediately a problem, we only need one 2 letter word but we are getting every possible match. Since we are only interested in the length we can simply count the number of distinct c.

q)count select distinct c from first t allComponentsSparse sparseRes

So we see that the first connected component has 4 links in the chain.

Doing this for each connected component and then selecting the max ensures that we have found the length of the longest chain.

q)max {count select distinct c from x} each t allComponentsSparse sparseRes

That completes the first method.

Here is the whole function together

hwords:distinct words iasc count each words;
t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b};
drops:n!allDrops'[n:1+til max t[`c]];
sparseRes:ungroup `col xkey update col:i, row:t[`wsym]?`$(w@’drops c) from t;
sparseRes:delete from sparseRes where row=count t;
max {count select distinct c from x} each t allComponentsSparse sparseRes}

neighbors: exec col from m where row in j;
f:{n:exec col from y where row in .[_;x]; /new neighbors
x[0]:count x[1];x[1]:distinct x[1],n; /update the two pieces of x
x}[;m]; /project this function on the sparse matrix
last f over (0; neighbors)};

points:til max m[`row];
while[count points;
i:first points;
connected,:enlist n:`s#n:distinct asc i,findConnectedSparse[i;m]; /add point in case it is island
points:points except n];

Now for method 2, instead of relying on the previous algorithm to find all the connected components, we just want to keep track of how deep each chain is.

We can do this by creating a new table q which only has columns, row and col and is grouped by row. Since we created the original links by going from columns to rows, we know that the links travel up from shorter strings(row) to longer ones(col)

q)show q:select col by row from sparseRes
row| col ..
—| ————————————————————————..
0 | 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 3..
1 | 62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 2..
2 | 29 40 66 91 108 112 114 125 135 154 167 168 168 184 211 226 245 267 281 ..
3 | 69 69 77 94 102 109 110 119 129 159 183 192 196 212 214 216 220 224 247 ..
4 | 38 48 60 74 86 92 95 115 123 131 132 135 145 184 195 197 213 238 246 249..
5 | 43 49 119 124 137 142 143 180 219 226 231 239 246 277 289 303 308 319 32..
6 | 28 31 72 73 88 100 104 126 141 145 161 175 190 198 235 240 248 253 289 3..

Now since this table is keyed on the row column, we can index into the table using a index table that has the row numbers and keep doing this until there are no more rows to return. The number of times we can index before we return nothing, is the length of the chain from that row.

First lets demonstrate indexing:

q)indexTable:([]row:0 1)

q)q ([]row:0 1)
28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 342 36..
62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 272 27..

We get the two rows that match from q.

If we flatten this list and rename this column row. We can reindex into q,

q)q select row:raze col from q ([]row:0 1)
452 549 634
391 391
587 786 803
395 493 759
524 629
704 834
613 823

We see the next level of neighbors, we can keep doing this. Again we can take advantage of over and just count how long before we return an empty table.

We will do this by creating a function that takes a dictionary and updates it, that way we can store both the intermediate results and the number of times we went down and all the nodes we visited.

the keys will be cur, which is the list of nodes we wish to visit next, depth, which is how deep we gone so far, and visited a list of all the nodes we saw. Then  we can write a function dive which will dive one level down into the table.

dive:{ $[count x[`cur]:select row:raze col from y x[`cur];
/index the table on the current neighbors we are exploring.
/We set cur to be the next level of neighbors
/if there are any new neighbors:
/we will add 1 to depth, and add the current new neighbors to visited and return x
[x[`depth]+:1;x[`visited],:exec row from x[`cur];
/otherwise we will return x

We then create a table of all the nodes we would like to visit with the initialized keys:

q)show n:update visited:count[n]#() from n:([]depth:0;cur:exec row from q);
depth cur visited
0 0
0 1
0 2
0 3
0 4

lets’ run dive on one the first of n:

q)first n
depth | 0
cur | 0
visited| ()
q)dive over first n
depth | 4
cur | +(,`row)!,()
visited| 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 3..

We see that we get back a dictionary of all the nodes visited as well as, the maximum depth achieved. If we do this for all n, we can select the max depth and we are done.

q)exec max depth from (dive/)each n}

Notice, we are storing all the visited nodes, so we can reconstruct what the path actually was, but we are not actually using it. So the entire second version looks like this:

hwords:distinct words iasc count each words;
t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b};
drops:n!allDrops'[n:1+til max t[`c]];
sparseRes:ungroup `col xkey update col:i, row:t[`wsym]?`$(w@’drops c) from t;
sparseRes:delete from sparseRes where row=count t;
q:select col by row from sparseRes;
dive:{ $[count x[`cur]:select row:raze col from y x[`cur];
[x[`depth]+:1;x[`visited],:exec row from x[`cur];:x]
n:update visited:count[n]#() from n:([]depth:0;cur:exec row from q);
exec max depth from (dive/)each n}

On my computer, I was able to find the length of the longest chain for a list of million words in under 15 seconds, using this second version and in a minute using the first version. If I was using less than one hundred thousand words, then  the two versions were about the same.

The python code took about a minute for a million words, (without using anything fancy, I’m sure it can be improved by replacing pieces with numpy components or using the latest version of python 3.7. where dictionary look ups are faster).

The real benefit here from KDB, is that reconstructing the path is super straightforward since we have the path indexes and can use them to index into the word table.



Graph Algo, Finding Friends

This is a rather standard interview question:

Suppose that you have points and the points are connected by edges. Imagine that the points are numbered, then we can represent whether there exists a connection between two points by writing all the pairs of points.

We can also represent this graph, as matrix, where 1 means that there exists a connection between i and j and 0 otherwise. We call this matrix the adjacency matrix since it tells us which points are connected.adjacencymatrix_1002

Given such a matrix, we want to find all the connected components. A connected component is either just one node if it has no connections or all the nodes that can be reached by traveling along the edges.

In the picture above, each of these graphs has only one connected component, but together there are three distinct connected components, since there is no way to reach from the first graph to the second by traveling along an edge.

First I will show my original solution in KDB. Then a slight improvement.
First note that the matrix must be symmetric, since if point 1 is connected to point 2, then point 2 is connected to point 1.

I wanted to write a function that would get all the points that were connected to a single point.

If I think of a particular row in this matrix, then all the 1s represent the neighbors of this point. If I then find all of their neighbors, and so on, I will have all the points that can possibly be reached.

First imagine I have a matrix, a:
To create it, I first create 100 zeros (100#0)
Then I pick six places to add connections, (6?100)
I replace the zeros with 1s.
I add this matrix to itself flipped, so that it is symmetric (a+flip a)
Divide by two so that any place that 1s overlapped are either 1 or .5, (.5*)
Then I cast it to an integer so that they are all 1s or 0s (`long$)

q)a:`long$.5*a+flip a:10 10#@[100#0;6?100;:;1]
0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0

Okay given a, I can find the neighbors of any of the 10 points.

0 0 0 0 0 1 1 0 0 0
q)where a[1]
5 6

So we get a list of points that are neighbors.

We need to create a function that will keep asking for neighbors and discard any duplicates.

We can find all the neighbors of the neighbors by selecting the neighbor rows of matrix  and then asking where for each row:

q)where each a 5 6

We get a list of neighbors.

We can then raze this list to flatten it:

q)raze where each a 5 6
1 1

Then we can ask for the distinct items:

q)distinct raze where each a 5 6

Now we would like to add this list to our original neighbors and check if we can find more neighbors:

q)distinct raze where each a 5 6 1
1 5 6

Running this again we get:

q)distinct raze where each 1 5 6
5 6 1

So there are no more neighbors to find, just the order changes and we will flip back and forth.

Anytime, we need to apply a function like this we can make use of the adverb in q called over. Over will apply the function until the results are the same for two successive iterations.

So my function to find all connected with comments looks like:

neighbors: where m[i]; /find the initial list of neighbors
f:{distinct raze x,where each y x}[;m]; /create the function where second argument     is fixed to be the adjacency matrix.
f over neighbors} /run the function over the initial neighbors and keep adding neighbors until there are no more to add.

/testing on a:
5 6 1

As expected this gives us all points connected to point 1, including point 1 itself.
To list all of the connected components we need to run this on each of the points:

q)findAllConnected[;a] each til count a
3 0
5 6 1
8 2 9
0 3
7 4
1 5 6
1 5 6
4 7
2 9 8
8 2 9


We get 10 rows specifying the connections. Now we need find the unique set.

The easiest way to do this is to sort each one and ask for the distinct lists:

q)distinct asc each findAllConnected[;a] each til count a
`s#0 3
`s#1 5 6
`s#2 8 9
`s#4 7

Which gives us 4 connected components.

We can look for the islands, by checking if there are any elements not on any of these lists. If so, they must unconnected to anything else.

Items not in the list are islands (only one node no connections)

q)connected:distinct asc each findAllConnected[;a] each til count a
q)islands:(til count a) except raze connected

q)count each connected /size of the components

We get an empty list, which means there are no islands to report.

Problem solved.

For those who have followed, you can see that we are actually doing a lot of duplicate work, since we keep asking for neighbors for points that we have already seen. To solve this we simply keep track of how many neighbors we have found so far and only query for the neighbors we have yet to see.

Instead of x being a list of neighbors, it is now a list with two items. The first is the number of neighbors found so far and the second is the list of neighbors.

We can use dot(.) application of drop(_) to remove the neighbors we have seen so far from the neighbors list. (.[_;x]):


neighbors: where m[i];

f:{n:raze where each y .[_;x]; /new neighbors

x[0]:count x[1];x[1]:distinct x[1],n; /update the two pieces of x

x}[;m]; /project this function on the matrix

last f over (0; neighbors)} /apply the function on the neighbors, with 0 currently visited

/running on a:

5 6 1

/large sparse matrix to test with

q)c:7h$.5*c+flip c:1000 1000#@[1000000#0;1000?1000000;:;1]

/Time both on my 2012 macbook
q)\t distinct asc each findAllConnected[;c] each til count c
q)\t distinct asc each findAllConnectedFast[;c] each til count c

A 10x improvement, the keen observer will notice that we don’t have to run the function on all the points in the first place.  We can go down the points and skip all the points that we have already seen. Here is what that looks like:


points:til count m;


while[count points;

i:first points;

connected,:enlist n:`s#n:distinct asc i,findAllConnectedFast[i;m]; /add point in case it is island

points:points except n];


q)allComponents a
`s#0 3
`s#1 5 6
`s#2 8 9
`s#4 7

q)\t allComponents c

This results in a 1000x increase speed up. Because we are only calling the function the minimum number of times.

Stable Sorting

Stable Sorting preserves the order of a column even after sorting on a second column.

KDB provides stable sorting by default. This is a great thing, and it allows you to easily implement grouped attributes on a table.

A table that looks like:

q)t:([]time:09:30:00+`time$til 9;sym:9#`AAPL`MSFT`IBM;px:100*9?1.0)
time sym px
09:30:00.000 AAPL 81.12026
09:30:00.001 MSFT 20.86614
09:30:00.002 IBM 99.07116
09:30:00.003 AAPL 57.94801
09:30:00.004 MSFT 90.29713
09:30:00.005 IBM 20.11578
09:30:00.006 AAPL 6.832366
09:30:00.007 MSFT 59.89167
09:30:00.008 IBM 4.881728

which is sorted by time and sort it by symbol so that all the symbols that are the same are together, while still preserving the time attribute.

q)`sym xasc t
time sym px
09:30:00.000 AAPL 81.12026
09:30:00.003 AAPL 57.94801
09:30:00.006 AAPL 6.832366
09:30:00.002 IBM 99.07116
09:30:00.005 IBM 20.11578
09:30:00.008 IBM 4.881728
09:30:00.001 MSFT 20.86614
09:30:00.004 MSFT 90.29713
09:30:00.007 MSFT 59.89167

While this seems pretty cool, I want to argue that this behavior is not only good, it is also simpler than the alternative.

In many languages, perl,java,python there is something called a comparator interface. Which usually asks the user to define a function

F(a,b) => if a comes before b;
if a comes after b:
if they are equal

The search then uses some algo to swap the items.

However, we can simplify this interface. We can call it swapporator, it only allows two options, swap, True or False. The items are presented in the order of the original list to the swap function. This will automatically be stable.




Another Dependency Management Scheme

This post is based on an accumulator pattern I encountered when converting recursive functions to be tail recursive.

For a long time I have been looking for a way to cleanly handle errors that arise in functions. It’s well known among programmers that only about 20% of code does any real work, the rest is there to take care of  IO and error conditions. That’s one of the afterthoughts that Joe Morasco has in How to learn a new programming language.

It takes me a few hours to recall how to make the linked list
work and get something on the air. To allow someone else to play the
game unsupervised (which means doing some error handling) usually
takes me a few days of programming.

KDBQ lets you omit a lot of ceremony and boilerplate so you can get to work faster. Often a function that does something useful, ie real work, is a few lines. However, you still need to do the error handling somewhere and it’s best if we can keep our functions doing mostly work. If there are specific error conditions that need to be handled then by all means handle it, otherwise assume that the input you get is what you expect. (I know this sounds crazy to all those paranoid defensive programmers, but frankly you might like where this post goes.)

The only ceremony that I insist on is that every function you write takes a dictionary as the input. I find this to be pretty liberating in general as you can easily add functionality over time without breaking an API and it allows for the function to implement sensible default values while allowing customization.

What I eventually settled on was something that would allow me to create dependent functions by ordering them.

I think of them as “getters.” Each function “gets” a value using a dictionary of configurations.

A list of getters can be easily seen on a page and read. If you know what each getter is doing then you can keep reading, otherwise you can go look at the definition.

I then would create a getter dictionary, each key in the getter dictionary will eventually be assigned a value. The value is either the value that was already in the input or the value of applying the current getter to the dictionary.

To make this a bit clearer, suppose that I want to create a function that gives my pet animal a name.

I have the following dependencies:

What pet animal am I getting? Turtle, Elephant, Dog, Snake

getAnimalType:{[args] first 1?`Turtle`Elephant`Dog`Snake}

Is the animal a boy or a girl? (obviously turtles and elephants are more likely to be girls)

   d:`Turtle`Elephant`Dog`Snake!0.7 0.7 0.5 0.5;
    $[first d[args[`AnimalType]]>1?1.0;`girl;`boy]}

Choose a name based on those two criteria.

getName:{[args] $[`boy~args[`BoyorGirl];`Kunee;`Kuni]} 
/(we like the name IPA transcription kʊniː)

So here is the basic idea: We combine these three separate verbs into a larger verb:

  :setIfMissing /[args;key getters; value getters]}

We can read this function by looking at what keys are being set in the getters dictionary. The keys tell you what this function plans on using or setting; the values tell you how the function will set them if the key is missing in the dictionary you pass in.

So for example if you don’t know what you want you can simply pass an empty dictionary to getPetAnimalName which will give you back something like:

AnimalType| Turtle
BoyorGirl | girl
Name | Kuni

Now suppose you are set on a particular pet animal you can pass a dictionary that has the animal preset.

AnimalType| Elephant
BoyorGirl | girl
Name | Kuni

So far we have simply seen that this construct of setIfMissing allows us to push in default arguments and have default verbs that use our preset arguments. Now we can also show that we also get basic error handling for free.

Suppose that we break the function that chooses the animal type. We would expect that the function will not return boy or a girl if we pass nothing, but should work as usual if we do pass it an animal type.

/break function getAnimalType
q)getAnimalType:{[args] break;first 1?`Turtle`Elephant`Dog`Snake}
q)getPetAnimalName[()!()] /we get back as much as was possible 
Name | Kuni
q)getPetAnimalName[presetAnimalDict] /works as before
AnimalType| Elephant
BoyorGirl | girl
Name | Kuni

Mechanics of how the verb setIfMissing operates:

The premise of this verb, as its name suggests, is that it sets the key in the dictionary that was passed in if that key was missing. If that key exists, it does nothing. To set a key we call the function that was passed in the dictionary we have so far. This allows functions to depend on values from previous functions or from arguments that are passed in. If a function throws an error, it catches the error and tries to set the next key that is given to it.

I have created essentially two different versions of this verb. They follow two different philosophies.

Simple Version: with primitive error handling and logging

The first version is the most simple and just logs what keys it is setting and what keys it failed to set. Here it is, pretty short:

log:{neg[2] (string .z.Z)," : ",x,"\t",y;}
setIfMissing:{[a;k;f]  log["INFO";"setting key: ",string k];
  .[{[a;k;f]$[ k in key a;a; @[a;k;:;f[a]]]};
     {[a;k;e] log["ERROR";
   "Fail to set key: ",(string k)," Error: ",e]];a}[a;k]]};

Here I have fully elucidated every step:

/first we create a logging utility
 neg[2] /this is the handle to standard error 
 (string .z.Z) /we convert the time into a string
 ," : ",x,"\t",y; / and append the type of error to the message
/so for example 
/.log["INFO";"HELLO WORLD"] 
/prints 2017.09.17T09:06:14.309 : INFO HELLO WORLD

/now we define setIfMissing
setIfMissing:{[a;k;f] /a is the dictionary,k is the key, f is the function
 .log["INFO";"setting key: ",string k]; /log that we will be setting k
 .[ /protected execution will trap any errors
 {[a;k;f] /a,k,f are the same as the beginning
 $[ k in key a; /check if k is in the dictionary
 a; /if it is return a
 @[ /general apply, allow you to modify contents of a container
 a; /a is our container or dictionary in this case
 k; /k is the key or index in the container we will modify
 :; /ammend (set) is the function we will use 
 f[a] /is the value we will set, which is the verb f applied to a
 / this is what allows f to depend on any set items in the dictionary
 ] /this ends general apply 
 ] /this ends the conditional of whether k was in a
 }; / this ends the first function in the protected execution
 (a;k;f); /These are the arguments to the protected execution
 {[a;k;e] /a is still the dictionary, k is the key and e is the error
 .log["ERROR";"Fail to set key: ",(string k)," Error: ",e]; /log error
 a} /return the dictionary, and ends the trap function
 / this allows us to chain functions even if there is an error
 [a;k] /pass in the value of a and k, 
 /otherwise the error handler only sees the error
 ] /ends the protected execution
 }; /ends setIfMissing

Fancier version with introspection and profiling:

The second version of setIfMissing is likely to please those defensive programmers that I mentioned earlier.  I will not go through and elucidate this entire function. Instead I will point out the salient differences in this significantly more involved version of setIfMissing.

The first difference is that there no longer is text logging. Instead the dictionary that you pass in gets updated with all of the information about what happened. There are two tables that get added simErrors and simProfiling. So for example here is what our examples from before look like:

            | ::
simErrors   | (+(,`failed)!,`AnimalType`BoyorGirl`Name)!+(,`reason)!,("break"..
simProfiling| (+(,`simKey)!,,`initialized)!+(,`time)!,,2017.09.17T09:53:24.471
AnimalType | `Elephant
 | ::
simErrors | (+(,`failed)!,())!+(,`reason)!,()
simProfiling | (+(,`simKey)!,`initialized`AnimalType`BoyorGirl`Name`WelcomeM..
BoyorGirl | `girl
Name | `Kuni

We can look at those tables

failed    | reason 
----------| ------------
AnimalType| "break" 
BoyorGirl | "AnimalType"
Name      | "BoyorGirl" 
simKey     | time 
-----------| -----------------------
initialized| 2017.09.17T09:55:59.166
failed| reason
------| ------
simKey     | time 
-----------| -----------------------
initialized| 2017.09.17T10:09:04.062
AnimalType | 2017.09.17T10:09:04.062
BoyorGirl  | 2017.09.17T10:09:04.062
Name       | 2017.09.17T10:09:04.062

As we can see, we now get all of the logging information from the object itself. This is very powerful because it allows us to do queries against this data later to either improve performance by using the profiling table or to see where the errors are coming from.

The second feature takes advantage of the simErrors table. Here is where we become super defensive,  instead of naively trying to run every function against every key, we can preemptively fail if we know a function depends on an earlier key that failed. This is the reason, we now no longer return a name if boyOrGirl failed. Since getName explicitly depends on boyOrGirl being set, so we can premtively bail and not even bother running the query.  Therefore, you can see that reason given will either be the error message from running the function or the keys that were required. To demonstrate this a bit further, let’s add another function to our getPetAnimalName a welcome message:

  "Welcome Home, ", string args[`AnimalType]," ", args[`Name]}
 :setIfMissing /[args;key getters; value getters]}

Now if we run getPetAnimalName:

failed        | reason 
--------------| -----------------
AnimalType    | "break" 
BoyorGirl     | "AnimalType" 
Name          | "BoyorGirl" 
WelcomeMessage| "AnimalType,Name"

As we can see all keys that were required were mentioned as the reason for the failure. In particular WelcomeMessage failed because both AnimalType and Name were missing. We can do this by inspecting the function before calling it, if any of the keys in the failed column of simErrors table appear in the function, we report them and bail.

Finally, the last feature of setIfMissing is that we can pass the dictionary by name or by value. So far in all the examples we have passed the dictionary by value and gotten a new value back with some keys set. However, if we call setIfMissing with a name of a dictionary it will return the name and modify the dictionary in place.

AnimalType| Elephant
AnimalType    | `Elephant
              | ::
simErrors     | (+(,`failed)!,())!+(,`reason)!,()
simProfiling  | (+(,`simKey)!,`initialized`AnimalType`BoyorGirl`Name`WelcomeM..
BoyorGirl     | `girl
Name          | `Kuni
WelcomeMessage| "Welcome Home, Elephant Kuni!"

Finally here is the code for the introspecting and more defensive setIfMissing:

 if[not `simErrors in key a[];
 $[neg[11h]~type a;
 a set a[],simT;
 $[ k in key a;
 ({[k;d]d _ k}[k];{[k;d]d upsert (k;.z.Z)}[k])]; 
 [$[any c:@[0!a[`simErrors];`failed] in `$1_'s where(s:-4!last value f) like "`*";
 'raze "," sv string @[0!a[`simErrors];`failed] where c;
 ({[k;d]d _ k}[k];{[k;d]d upsert (k;.z.Z)}[k];{y;x}[res])]]
 {[a;k;e]  @[a;`simErrors;upsert;(k;e)]}[a;k]]};

/#### Short and Sweet Documentation:
How to use:
/Create a list of getter function that take a dictionary as inputs

/Create a meta function that composes these getters
 :kwargs:setIfMissing/[kwargs;key getters;value getters]}

/Apply the meta function to either a dictionary or a name of a dictionary
sigil | ::
simErrors | (+(,`failed)!,())!+(,`reason)!,()
simProfiling| (+(,`simKey)!,`started`A`B`C)!+(,`time)!,2017.09.14T13:44:24.05..
A | 5
B | 42
C | 47
sigil | ::
simErrors | (+(,`failed)!,())!+(,`reason)!,()
simProfiling| (+(,`simKey)!,`started`A`B`C)!+(,`time)!,2017.09.14T13:45:02.09..
A | 5
B | 42
C | 47
sigil | ::
A | 10
B | 14
simErrors | (+(,`failed)!,())!+(,`reason)!,()
simProfiling| (+(,`simKey)!,`started`A`B`C)!+(,`time)!,2017.09.14T13:47:59.24..
C | 24

/We can look at profiling information
simKey | time 
-------| -----------------------
started| 2017.09.14T13:45:02.099
A | 2017.09.14T13:45:02.099
B | 2017.09.14T13:45:02.099
C | 2017.09.14T13:45:02.099

/We get automatic error handling and prememptive failiure
/let's add an error into getA
sigil | ::
simErrors | (+(,`failed)!,`A`C)!+(,`reason)!,("broken";,"A")
simProfiling| (+(,`simKey)!,`started`B)!+(,`time)!,2017.09.14T13:51:06.419 20..
B | 42
failed| reason 
------| --------
A | "broken"
C | ,"A" 
/If the value is set the meta function will still work.
sigil | ::
A | 10
B | 14
simErrors | (+(,`failed)!,())!+(,`reason)!,()
simProfiling| (+(,`simKey)!,`started`A`B`C)!+(,`time)!,2017.09.14T13:53:46.03..
C | 24

 The sim prefix is used in the dictionary


Mechanics of the As-Of-Join Primitive

Recently, I had the necessity to perform an operation that didn’t exist in KDB, but whose scent reminded me of a primitive provided in KDB – As Of Join (aj). I will sketch the problem that motivated me. Then I will describe what the As Of Join operation does and how it is implemented in KDB. Finally, I will describe the new verb I created that solves the original problem.

Primal Scenario:

We have the following tables produced by two systems in the same company. A table of user interactions and another table from an operational system recording transactions. To make this a bit more concrete, our operational system manages inventory in a particular warehouse and so has the following columns: Date,ItemID,Quantity,TimeStamp. Here are 10 sample rows and code to generate it:

For the sake of making things a bit narrower, 
I am truncating a GUID with the following helper functions
q)trunID:{`$8#string x}
q)genID:{trunID each x?0Ng}
q)sysTable:([] Date:20170805+til 10;
      ItemId:genID 10;
q)show sysTable
Date     ItemId   Qty timestamp
20170805 72de67b9 2   14:27:23
20170806 02e583e3 6   23:43:36
20170807 cffa7aa8 4   07:42:14
20170808 327b47a8 1   26:28:04
20170809 d1b9ccd1 0   11:48:17
20170810 6cb019b3 2   04:59:38
20170811 bb466f80 9   17:54:00
20170812 70112cc0 6   00:54:33
20170813 c9d56ab5 0   22:19:31
20170814 96b75d97 8   10:20:35

The user interactions table keeps track of user related events. It has the following columns: UserId, Date, EventType, ItemId, Qty, timestamp. Here are 10 sample rows and code to generate it:

q)userTable:([]UserId:genID 10;
           Date:20170805+til 10;
           ItemId:genID 10;
q)show userTable

UserId   Date     EventType ItemId   Qty timestamp
88c5bbf5 20170805 buy       b19c6770 6   25:51:08 
42eeccad 20170806 buy       f12c9963 6   01:31:20 
c5510eba 20170807 buy       9ee93847 1   22:36:58 
5f344195 20170808 buy       7bfa7efa 8   20:50:09 
9a4f68c6 20170809 buy       24e2f268 5   16:27:03
26ba3a3e 20170810 buy       94922173 4   17:13:45 
c046e464 20170811 buy       4dec3017 9   13:16:27 
0e09baa4 20170812 buy       f5f7b90e 2   24:29:14 
a9f44e57 20170813 buy       0cb8391a 7   07:38:13 
1e1f2a49 20170814 buy       2df03fb0 0   23:10:12 
We will remove all rows that don’t have EvenType=`buy, so we can ignore this column from now on, but I left it in as an example of the kind of data cleaning you might perform. Also for the rest of this tutorial, I create a universe of itemIDs and userIds that I sample from this creates a more realistic simulation where itemIds will overlap in the two tables.  itemids:genID 1000; userids:genID 100000

We want to perform some analytics to figure out which warehouses we should use for different products based on user geography. To do this we need to join the inventory data to the user data. Unfortunately, there is nothing that links these two tables together, these systems are independent even though they share the same real world events.  You work with what you’ve got. Since these events are connected, if we see a transaction in the user table that has the same date, same ItemId and same Qty we can be pretty sure that it is the same event. This can be performed with a simple left-join like so:

/This join takes less than half a second on 1 million rows 
/in both tables
q)\t userTable lj `ItemId`Date`Qty xkey sysTable

However, our company actually has many, many users so that just using Date and ItemId and Qty is going to get many possible matches of which KDB will always choose the first. This is exacerbated by the fact that most times Qty is 1. So we need to incorporate the timestamp column to get better matching results.  We would like to match two rows if Date ItemId and Qty match and the timestamps are sufficiently close.  In particular, we would like to take the first match from the userTable where everything else being equal the timestamps are within 5 minutes of each other (assume that if they are outside this window, some other system must be responsible for that row).

The As Of Join Primitive for TimeSeries analysis

The idea behind As Of Join is that you can look at the state of universe in another table as of a certain time. The most common example involves joining a trades table to a quote table. I won’t belabour that example. Instead, I’ll translate this into our fictional internet store company.  I want to know what the inventory was as of a particular time when a product was ordered. I have an inventory table:

       (i#til 10)+(2017.08.05D00:00:00.000+i?1000000000000000))
/let's sort it in asc order of the DateTimeStamp
q)invTable:`DateTimeStamp xasc invTable
q)show invTable
ItemId   Qty DateTimeStamp                
30c86bca 375 2017.08.05D00:00:00.075460419
73d8673e 772 2017.08.05D00:00:00.114482831
2e6d75b2 166 2017.08.05D00:00:00.330689365
6c55cdcf 860 2017.08.05D00:00:00.336812816
a3715f66 364 2017.08.05D00:00:00.457884749
9abd4b9f 396 2017.08.05D00:00:00.726431612
89e2e90e 37  2017.08.05D00:00:00.795442611
c7891977 999 2017.08.05D00:00:00.817328697

Given  an ItemId and DateTimeStamp we would like to know how much of that Item we had as of that Time.

In other words, we want the Qty for that ItemId where the time is as close as possible to the given time but no greater.

To make the problem a bit more interesting, we can imagine that we have a table of all the products and we want all the inventory as of 2017.08.07D00.00.00. We can do this pretty easily:

/First generate a table with all of the products and that timestamp;
q)show queryTable
q)show queryTable
ItemId   DateTimeStamp                
d44f9458 2017.08.07D00:00:00.000000000
aa25ba3a 2017.08.07D00:00:00.000000000
beec200c 2017.08.07D00:00:00.000000000
6e9b637f 2017.08.07D00:00:00.000000000
70c3756c 2017.08.07D00:00:00.000000000
a8f4636b 2017.08.07D00:00:00.000000000
befc6ca7 2017.08.07D00:00:00.000000000

 Then we put the columns we want to join 
 in order from least frequent to most frequent
 with the time column last. 
 Then the two tables, query table first
ItemId   DateTimeStamp                 Qty
d44f9458 2017.08.07D00:00:00.000000000 143
aa25ba3a 2017.08.07D00:00:00.000000000 774
beec200c 2017.08.07D00:00:00.000000000 925
6e9b637f 2017.08.07D00:00:00.000000000 711
70c3756c 2017.08.07D00:00:00.000000000 447
a8f4636b 2017.08.07D00:00:00.000000000 190
If we want to see when the time is that is matching 
we can use it's sister command
ItemId DateTimeStamp Qty
d44f9458 2017.08.06D03:43:09.477791637 143
aa25ba3a 2017.08.06D03:45:38.244558871 774
beec200c 2017.08.06D08:46:36.299622582 925
6e9b637f 2017.08.06D19:46:02.531682106 711
70c3756c 2017.08.06D07:43:15.845127668 447
a8f4636b 2017.08.06D03:44:03.353382316 190

We can verify that all the times are before 
 midnight on the 7th of august

Let’s now look at the internal mechanics of how this actually works.

First KDB finds all the candidate matches based on all the column till the last one. Then the aj verb makes use of the binary search function (bin) in KDB. Bin takes two arguments of the same type, the first is a sorted list in ascending order, the second is either a list or an item. For each second argument it returns the index that is equal or less in the first argument. A simple example will illustrate this:

q)0 1 2 3 4 5 6 7 bin 4
q)0 1 2 3.5 4 5 6 7 bin 3.0
q)0 1 2 3.5 4 5 6 7 bin 0.1 0.2 1.9 3.6
0 0 1 3

So assuming the candidate matches are sorted, picking out the latest as of the query time becomes as simple as performing binary search on each row we are trying to match against the candidate matches, which is exactly how KDB does it.

Another way of thinking about “bin” is that it returns the last element from a set of candidate matches, where match means being before the current element.

With this view in mind, it should be clear that we can’t use As Of Join to solve the Primal Scenario. Since as-of-join essentially gives you the last row that is before the time you are asking, whereas we are looking for the first row.  If we consider shifting the timestamps so that all the candidates within the tolerance are before the matching row’s timestamp we would still get the last match instead of the first one.

Luckily, there is another function called binr that returns the first index that is equal or greater to the query. Which leads us to:

q)0 1 2 3 4 5 6 7 binr 4
q)0 1 2 3.5 4 5 6 7 binr 3.0
q)0 1 2 3.5 4 5 6 7 binr 0.1 0.2 1.9 3.6
1 1 2 4

Since binr returns the first index that is equal or greater, we can think of it as returning the first match from a set of candidates where matches are determined as being greater or equal to the element being matched. This bin right function, so named because it returns the binary search from the right, is now enough to go and solve the primal scenario.

A New Verb AJR as of join right

Let’s take a quick look how aj is implemented:

k){.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)bin x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}

Sure enough there is the bin verb right in the middle of the verb.

So we can create another verb called ajr and define it like this:

k){.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)binr x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}

A quick note, the easiest way is to simply go into k mode by typing one backslash. But in a script it’s easier to just to write:

k)ajr:{.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)binr x#y;
     y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}


"k" "ajr:{.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)binr x#y;
     y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}"

However, you define this verb, you can now answer the question, what was the inventory like immediately after 2017.08.07D00:00:00.000000000 at the first recorded point.

ItemId   DateTimeStamp                 Qty
d44f9458 2017.08.07D00:00:00.000000000 38 
aa25ba3a 2017.08.07D00:00:00.000000000 5 
beec200c 2017.08.07D00:00:00.000000000 723
6e9b637f 2017.08.07D00:00:00.000000000 783
70c3756c 2017.08.07D00:00:00.000000000 56 
a8f4636b 2017.08.07D00:00:00.000000000 112
befc6ca7 2017.08.07D00:00:00.000000000 871
c8bc95e4 2017.08.07D00:00:00.000000000 586

/Similarly we can define the ajr0 verb as follows:
/ and that way we can see what time it matched
q)k)ajr0:{.Q.ft[{d: z;$[&/j:-1<i:(x#z)binr x#y;
    y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}
ItemId   DateTimeStamp                 Qty
d44f9458 2017.08.12D15:17:13.710407838 38 
aa25ba3a 2017.08.11D22:32:05.314057246 5 
beec200c 2017.08.13D03:55:10.402670877 723
6e9b637f 2017.08.11D19:32:36.058830770 783
70c3756c 2017.08.12D18:53:29.659736235 56 
a8f4636b 2017.08.15D10:47:32.423256268 112
befc6ca7 2017.08.09D03:48:48.974271199 871
c8bc95e4 2017.08.10D12:42:14.533837072 586


So returning back to our primal scenario, all we need to do is add the tolerance to the userTable timestamp. Then we can use this verb to join on all the previous columns and and take the first match within the tolerance.

first I will create another column 
that is 5 minutes offset into the future to match on
q)update timestamp:timestamp+00:05,
  timestampOriginal:timestamp from `userTable
q)show userTable
UserId   Date   EventType ItemId Qty timestamp timestampOriginal
1faaa20e 20170805 buy 992098f0 7 19:58:02 19:53:02 
a419e88a 20170806 buy 327d41d0 8 15:12:55 15:07:55 
79cdd3d7 20170807 buy bc8ff04f 0 00:38:48 00:33:48 
4155154c 20170808 buy b7318a04 3 04:21:50 04:16:50 
a22652c8 20170809 buy 9e35c5a0 7 09:01:45 08:56:45 
277f1b93 20170810 buy 8cc964a4 5 07:56:52 07:51:52 
a7966916 20170811 buy c2c5bfaf 9 22:59:05 22:54:05 

Then we will perform the new AS OF JOIN RIGHT
Date ItemId Qty timestamp UserId EventType timestampOriginal
20170805 2fc6d170 5 00:50:21 a5732847 buy 07:45:49 
20170805 78bc9724 5 00:54:57 3f039eb6 buy 04:12:17 
20170805 3135e6a8 8 00:57:40 705b75d5 buy 05:03:23 
20170805 4a4b9e36 9 01:42:09 6987be4e buy 02:00:52 
20170805 a6b5e1f3 5 01:59:56 3850deb3 buy 05:53:19 
20170805 d9d0a16c 9 02:12:31 620bda64 buy 10:39:09 
20170805 7c2652a7 3 02:30:49 7af80798 buy 04:19:24 
20170805 c90c6086 5 02:42:07 d4256817 buy 04:55:00 
20170805 be19e79f 6 02:43:57 5c0d3d58 buy 09:38:14 
20170805 264e5489 9 03:36:12 82b8912f buy 04:20:57 
20170805 0ba0bfd2 0 04:01:53 4f996c1d buy 16:11:22 
20170805 cb3e189d 0 04:07:03 675cd3b7 buy 08:43:38 
20170805 f2c41bca 6 04:36:03 855525d7 buy 09:33:50 
20170805 8eec99e2 2 04:36:26 15cab5af buy 05:07:27 
20170805 34894f41 9 04:42:40 ba2ca22d buy 06:11:16 

/It will match to the first at or after the timestamp,
 since we pushed the timestamp up 5 minutes it will be from within 
the tolerance to possibly the last index if no matches.
The issue is that we can now be outside the tolerance 
on the other side, to prevent this we can simply filter
We filter where the matches exist and 
timestamps are within the tolerance
q)select from joined where not null UserId,00:05>timestampOriginal-timestamp
Date ItemId Qty timestamp UserId EventType timestampOriginal
20170805 ac2dbea6 1 07:21:21 99303d09 buy 07:20:27 
20170805 b842007c 2 11:05:40 5171565d buy 11:04:20 
20170805 2d948578 7 19:10:14 75344162 buy 19:13:26 
20170805 c061f673 5 23:20:43 40cf305a buy 23:20:55 
20170805 9396fa95 2 24:25:48 d7482537 buy 24:22:43 
20170806 a73e1906 0 04:45:55 5c7ebb13 buy 04:43:15 
20170806 49ad69e9 6 08:16:19 c6df7f37 buy 08:18:55 
20170806 931226c9 3 20:39:59 30789ba9 buy 20:34:59 
20170807 39e53bb2 9 03:43:39 ccc71cae buy 03:47:14 
20170807 40f3f6de 8 05:56:22 fc967f51 buy 05:59:30 

That’s it.

here is all the code to simulate this:

trunID:{`$8#string x};
genID:{trunID each x?0Ng};
itemids:genID 1000;
userids:genID 100000;
sysTable:([] Date:i#20170805+til 100;
sysTable:`Date`timestamp xasc sysTable;
 Date:i#20170805+til 100;

k)ajr0:{.Q.ft[{d: z;$[&/j:-1<i:(x#z)binr x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y};
k)ajr:{.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)binr x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y};
update timestamp:timestamp+00:05,
 timestampOriginal:timestamp from `userTable;
userTable:`Date`timestamp xasc userTable;

show select from joined where not null UserId,00:05>timestampOriginal-timestamp;