Timeouts are a hack.
Timeouts can be worse than a hack.
Timeouts with Retry logic are almost always a recipe for disaster.
Because Timeouts are at best a hack they should be avoided.

Today, I was upgrading our micro-service backtester infra. At the core layer, there is a piece of code that is responsible for connecting services to other services. This code has clearly been a source of trouble in the past. Signs of wear and tear include: copious use of globals, lots of debugging statements, and additions of randomness in choosing values. Any micro service based architecture is always going to get stressed in one place and that is at the edges between services. Having verbose debugging on it’s own is a good thing, but almost always is an after thought. Adding globals is a good way to allow inspection of a live system. Adding randomness when choosing defaults for retries and timeouts, well that’s a whole other level.

What went wrong? At the face of it, inter process requests are simple, especially when they can rely on tcp. A tcp connection is instantiated an asynchronous request is made and a reply is expected. On a sunny day, this works! Suppose the service you are requesting is either busy or dead? How would you tell the difference. Well if it’s dead, you can’t instantiate that tcp connection. One option is on learning that the service you rely on is dead you should just die as well. But what if there was a network partition and if you had only tried one more time, you could have recovered. Enter retry logic. What if the first attempt you made to connect somehow got lost, you should probably just give up after some time and try again later. What if you connected, but subsequently the service died. Well if you don’t hear from the service within some reasonable amount of time, you should try to reach another viable service. Well between all of these many retries and timeouts, if a bunch of services all try to reach one instance in particular that instance can freeze. Then they will all try again and this cycle will persist. So adding a bit of randomness can help keep a bunch of clients from locking one service up. Essentially each client agrees to try again, but in an unspecified amount of time, so that a coordinated request is unlikely. However, this just defers the problem, because if the amount of time that it takes to service all the requests exceeds the maximum amount of randomness inserted into the configuration. The service will still feel like everyone is asking it to do something at the same time. So today, when I restarted the system, every single service piled in to request the same piece of data from one service, the service got overwhelmed, then they all tried again at random intervals, but the service was still dealing with overflow from previous requests, which it could no longer honor, since they had timed out the previous query. REPEAT.

This leads, me to my first point, timeouts are a hack. Sure, there is probably some case where it is the proper thing to do. However, if you are willing to retry the same connection 3 times for 10 seconds, you are better off trying once for 30 seconds, especially if you are on TCP, where ordering of messages is guaranteed. If you are going to try different sites/connections, you are still better off trying less times and waiting longer. If you are patient, your probability of success is the same.

Suppose during a pandemic you go to buy toilet paper. When you get there you see a line going out the door, you are not sure if there will be toilet paper left when it is your turn. You balk and then come back 15 minutes later. If you couldn’t do any useful work in those 15 minutes, you might as well have waited on line. The supply of toilet paper does not depend on how many times you balk and rejoin the line. Counting the members on the line is a much better proxy, than a timeout. Asking the other members, how much toilet paper they expect to buy is even better. You might even be able to form a consensus among your peers on whether you expect there to be toilet paper. Perhaps, they have recorded their interactions with the store and know what quality of service you can expect.

All of these alternatives have analogues in a microservice infrastructure, I will not go through them here, they deserve their own post.

We are committed to the timeout/retry gambit. As far as I know the best timeout/retry combo is implemented in tcp. It uses Exponential backoff simply put, when you encounter a failure to reach the service, you try again but each time you wait a multiple(>1) of the period you waited last time before retrying again. We still need randomness. Otherwise, we will have just scheduled collisions exponentially far apart. Instead each of the shoppers multiplies their wait period by a random number between one and two raised to the power of retries. This spreads out the requests evenly each time increasing the amount of time that the requests are spread over accounting for the evidence that repeat failure indicates there are many clients.

Fifo Allocation in KDB

One of the most common questions in trading is how to attribute profits and losses. There are many ways of doing this each have their own advantages one of the simplest is first in first out allocation. Here is the basic concept illustrated with an example, I buy 7 apples for 10 dollars, sell 4 for 11 dollars, buy 2 more for 12 and sell 3 for 11 and then sell 2 more for 12. We can think of having a list of purchases 7,2 and list of sales 4,3,2. Each sale and purchase are associated with a price. No matter the allocation strategy, we know the profit can be calculated by a weighted sum of the sales ((4*11)+(3*11)+(2*12)) minus a weighted sum of the purchases ((7*10)+(2*12)) –> 7 dollars. Allocation tries to answer the question of which trades were bad. First in first out will allocate the first 7 sales to the first 7 purchases (7 dollars profit), leaving 2 more purchased and 2 more sold and allocate those to each other (0 profit). This tells us that all the profit was made from the first 7 purchased. However, a last in first out strategy might instead see this ((4*11)-(4*10)+((2*11)-(2*12))+((1*11)-(1*10))+((2*12)-(2*10)) which suggests that the second purchase at 12 dollars was a bad trade.

We leave the discussion for more elaborate allocation methods and their respective merits for another time. In this post we will focus on the mechanics of efficiently calculating fifo allocations. Luckily, Jeff Borror covers an elegant implementation in Q for Mortals, which I will only briefly repeat here. The idea is that you can represent the buys and sells as a matrix, where each cell corresponds to the amount allocated to that purchase and sale. By convention rows will correspond to purchases and columns to sales. So in our example, we can write the allocation as

 | 4 3 2
-| -----
7| 4 3 0
2| 0 0 2

I left the corresponding order quantities in the row and columns as headers but they are actually implied. Jeff also gives us the algorithm that produces this matrix.

  • First we calculate the rolling sums of the purchase and sales
    • 7 9 for purchases
    • 4 7 9 for sales
  • We then take the cross product minimum
    • 4 7 7
      4 7 9
  • We then take the differences along the columns
    • 4 7 7
      0 0 2
  • We then take the differences along the rows
    • 4 3 0
      0 0 2
  • We are done, as a bonus here is the code in Q
    • deltas each deltas sums[buys] &\: sums[sells]

Not only is this rather clever, there is a certain logic that explains how to come to this result. The cumulative sums tells you how much max inventory you have bought or sold till this point. The minimum tells you how much you can allocate so far assuming you haven’t allocated anything. The differences down the columns subtracts the amount you have already allocated to the previous sales. The differences along the rows tells you how much you have already allocated to the previous purchases. Since you can only allocate what hasn’t yet been claimed.

Having read this far, you should feel confident you can easily do fifo allocations in KDB. I know, I did. There are even stack overflow answers that use this method. There is one problem, that occurs the moment you start dealing with a non trivial number of orders. This method uses up n^2 space. We are building a cross product of all the buys and sells. We know that the final matrix will contain mostly zeros, so we should be able to do better. We can use the traditional method for doing fifo allocation. Keep two lists of buys and sells allocated thus far and keep amending the first non zero element of each list and created a list of allocations triplets, (buy, sell, allocated). Although, this is linear implementing this in KDB is rather unKdb like. For incidental reasons, amending data structures repeatedly which is what this algorithm entails is best done by pointers, which in KDB means using globals and pass by reference semantics. It’s not long, but it’s not pretty.

fastFifo:{[o]/takes orders signed to neg are sell
signed:abs (o unindex:group signum o);/split into buys and sells
if[any 0=count each signed 1 -1;:([]row:();col:();val:())];/exit early if there are no buys or no sells
k:(/Brains of the calculation called k (for allokate)
.[{[a;i;b;s]/args are allocations triplets(a), index(i), buys(b), sells(s)
if[0=last[get s]&last[get b];:(a;i;b;s)];/if either there are no more buys or sells to allocate, return the current allocations
l:a[;i];l[2]:$[c:b[l 0]<=s[l 1];b[l 0];s[l 1]];/l is the current allocation, if (condition c) is buys are bigger allocate the sell qty, otherwise the buy qty
(.[a;(til 3;i);:;l+(0 1 0;1 0 0)c];i+:1;@[b;l 0;-;l 2];@[s;l 1;-;l 2])}]/depedning on the c increment either buy or sell pointer
);/end definition of k
`.fifo.A set (3;count[o])#0;/create global list of allocation triplets, size of total orders is at least 1 more than max necessary size, consider a case where you have 10 sells for 1 qty, 1 buy for 10, you will have 9 allocations
`.fifo.B set signed 1;/Set the buys
`.fifo.S set signed -1;/Set the sells
res:{delete from x where 0=0^val} update row:unindex[1] row, col:unindex[-1] col, next val from flip `row`col`val!get first k over (`.fifo.A;0;`.fifo.B;`.fifo.S);/delete the cases of 0 allocation, return the original indexes of the orders after apply k until the allocations stop changing, return the allocations as a sparse representation of the matrix
delete A,B,S from `.fifo;/delete the global variables to clean up
res}/return the result

This algorithm, has sufficient performance that we could stop there. However, we could ask is there a way to get all the benefits of the elegant array solution without paying the space cost. The answer is that we can, the trick is that as we have noticed most of the matrix will actually be filled with zeros. In particular, we can see that the matrix will essentially traverse from the top left hand corner to the bottom left hand corner. If we could split the problem into small pieces and then stitch the solutions together we would have the original path of allocations.

I will now briefly sketch out how we can split this problem into pieces and then I will present an annotated version of the code.

If we just look at the first 100 buys and first 100 sells. We can simply apply Jeff’s algorithm. If we wanted to apply it to the next 100 buys and next 100 sells, we would find that we have a problem. We need to know three things, we need to know the index of the buys and sells we are currently up to and any remaining buys and sells that we have not allocated to yet in the previous iteration. Strictly speaking we can only have unallocated quantities on one side, but it is easier to simply keep track of both and letting one list be empty each time.

Here is the annotated code:

fifoIterative2:{[o]/takes orders signed, neg are sell
signed:abs o unindex:group signum o;/split into buys and sells
iterate:{[a;b;s]/Brains of the calculation called iterate takes accumulator (a), chunks of (b), chunk of sells (s), accumulator has three slots, slot 0 is for the current count where we are up to in the buys and sells, slot 1 keeps track of the unallocated qty from the previous iteration, slot 3 keeps track of the currrent allocations
if[any 0=sum each (b:a[1;`b],:b;s:a[1;`s],:s);:a];/concatenate the unallocated qty to the respective places and check if one is an empty list, if so exit early and return the accumulator
t:sm m:deltas each deltas sums[b]&\:sums s;/apply Jeff’s algorithm and shrink matrix to sparse rep
c:(sum .[0^m;] ::) each 2 2#(::),i:last[t][`col`row];/Sum the last col, and last row allocated so that we can calculate the amount allocated to the last buy and sell respectively
a[1]:`s`b!.[i _’ (s;b);(::;0);-;c];/drop all the buys and sells that were fully allocated and subtract the amount allocated on the last buy and sell, from the first of what’s left assign this to slot 1, which holds the remainder
a[2],:update row+a[0;`b], col+a[0;`s] from t;/Append to slot 2 the current allocations, updating the indexes using the counts from slot 0
a[0]+:(`s`b!(count[s];count[b]))-count each a[1];/update the count, adding to the count the currently allocated and subtracing the remainder
a};/return a
k:max div[;200] count each signed;/calculate the max  number of rows so that each buys and sells has at most 200 elements, configurable to the maximize efficiency of the machine
res:last iterate/[(`s`b!(0;0);`s`b!(();());sm 1 1#0);(k;0N)#signed 1;(k;0N)#signed -1];/reshape the buys and sells and iterate over the chunks, select the last element which is the allocations
update row:unindex[1]row,col:unindex[-1] col from res}/reindex the buys and sells into the original indexes provided by the input o

This code actually performs much faster than the iterative version and is shorter. This code essentially has no branching except where there are no more allocations on one side and exits early.

If anyone has a better way of computing fifo allocation let me know in the comments.

Below are some timings and graphs of memory usage. The code can be found here.

The Is-Ought Distinction

As a topic, the is-ought (or fact-value) distinction has been beaten to death. It is so straightforward that it is surprising to see a very intelligent person like Sam Harris completely misunderstand it.

First, let’s go over the is-ought distinction. Famously pointed out by Hume, the idea is that no set of ‘is’ statements can lead to an ‘ought’.

For example:

The person is dying.
This medicine on the counter can save her.
-> the person ought to take the medicine

This simply doesn’t follow unless you believe the person ‘ought’ to live.

Now, to show that I don’t misunderstand Sam, I’m going to steelman his position as is done in the rationalist community.

We start from a naturalist claim. Consciousness is a natural phenomenon that each of us experiences and cannot really ignore. Once we acknowledge that consciousness exists and that we are conscious beings, we must acknowledge the reality that we are programmed to avoid suffering. To illustrate this point, Sam will note that we will pull our hand off a fire. In other words, consciousness is the proximate cause of human behaviour (as opposed to the root cause, which is the big bang or whatever happened at the beginning of the universe). That cause works in predictable ways that science can study. Therefore, we can posit that the subject of a science of morality is to figure out how to achieve the types of goals that are built into the structure of consciousness in the most effective way.

The only axiom needed is that navigating away from the worst misery and suffering is a good place to start such a science.

Sam compares this field to the study of medicine, which most acknowledge is a science. The goal of medicine is to figure out what substances will cause a human to live a healthier life. So medicine generally concerns itself with efficacy of taking pain away from living things.

The only axiom underpinning medicine is that it is worthwhile to study how to keep living things, specifically human beings, healthy.

Therefore, there is nothing more special about the science of morality than the science of medicine.

Now, I will show you why Sam is wrong. What Sam seems to miss is that the ‘ought’ itself is the moral claim and is not part of the science. Simply put, medicine does not tell you that you ‘ought’ to give the dying man this pill that will save him. It only tells you that doing so ‘will’ save his life. This explains why we need the hippocratic oath in addition to the study of medicine. It also explains how Nazi doctors, who performed extremely immoral scientific studies on people, were in fact practicing a form of science. If Sam wants to study what will cause the greatest well being in conscious minds, that’s fine. There is nothing that his science will compel us to do, until we bring our values to the table.

For instance, if the person in the original syllogism who was dying was a ‘bad’ person, we arguably ought not to give her the medicine. However, the scientific fact that medicine would save her life remains. Or suppose the person is really old and wants to die, she might choose not to take the medicine. Again the scientific fact about the medicine saving her life remains. The same is true of Sam’s science of morality. It will tell you what things maximize well being without telling you what to do about them. It might be true that acting in a way that causes humanity to suffer will indeed decrease the general well being of conscious agents (indeed this seems almost tautological) and is against the desire of each agent. However, it does make a rational agent compelled to act one way or another. For instance, we can easily imagine a situation where one conscious agent wants to hurt another, because they are a psychopathic serial killer. We can imagine putting the psychopathic serial killer on drugs that will cause them to stop wanting to kill the victim. This will add to the well being of the victim and, if the drugs are really good, they might also add to the well being of the killer. However, during that intervention, forcing the drugs down the throat of the killer, will certainly add to the suffering of the killer, who wants nothing more than to strangle his victim. I am sure what the moral answer should be, but the moral science will permit me no such claims. It will only be able to measure all of the well being of all the agents and tell me how all the agents respond to different treatments.

Sam might be right, that we generally understand that the goal of medicine is helping people lead healthier lives. However, that doesn’t acknowledge where the science is a science and where it is just the overwhelming social consensus of what to do with the knowledge. He can even claim that such a distinction is not useful in medicine. although the entire field of bioethics would seem to discredit such a claim. It is also true, that any study of morality must rely on a solid foundation of facts. Those facts might belong to the field of science of morality, which will tell you the consequences of actions on conscious beings. You cannot make an informed moral decision without the facts. So the facts are a necessary, but insufficient condition for moral claims.

If Sam were right, then a science of Sadism would compel me to be a sadist. Yet, we might want to study what causes suffering and how to cause the greatest amount. This field might spawn other useful subfields such as dentistry. The choice of the field of study is in itself a value judgement, in so far as it prioritizes time across an infinite array of fields and focuses on some particular domain. For example, we can agree that the government should sponsor basic research in astronomy, but not astrology. People can disagree with that statement in many ways, but none of those claims and statements are about facts. The facts remain, that astrology studies occult things and astronomy studies the cosmos. Choosing between these disciplines is a valid choice and a sufficiently functional society can choose to allocate its resources into either field.

There is a separate argument to have with Sam’s choice of axiom. We can imagine a drug that will give everyone the conscious feeling of well being, and yet it might be immoral to force everyone to take it.

I know that Sam is very smart and so I have tried as hard as possible to defend his position in a strong way. If anyone can help me make his case stronger, that is welcome.

It’s Flooding on all sides!

This post is inspired by a happy coincidence, I was doing p2 of day 14 of advent of code for 2017 and I had never formerly encountered the flood fill algorithm. This confluence of events inspired me to come up with a new way to solve this classic problem. 

You are given a boolean matrix and told that points are neighbors if they are either next to each other horizontally or vertically, but not diagonally. If point (a) and point (b) are neighbors and point (c) is a neighbor of point (b) then (a),(b) and (c) are in the same neighborhood. A point with no neighbors is a neighborhood of 1.  You are asked how many such neighborhoods exist. 

Shamelessly borrowing the picture from adventofcode:

|      |   
V      V   

You can see that this piece of the matrix, has 9 neighborhoods. Points that are off in the boolean matrix are simply ignored. 

This problem is a classic because it is apparently how that bucket fill tool works in every paint program. Essentially you keep filling all the pixels that are neighbors until you are done. 

Now the classical way to solve this problem is flood fill. Essentially the problem is broken into two pieces:

  • An outer loop that visits every point
  •  An inner loop that looks for neighbors using breadth first search or depth first search

Here I will annotate the Q code that achieves this result.

ff:{[m] /this is flood fill function
	/ returns a table with all the distinct neighborhoods
	/m is a matrix
	/c is the number of rows
	/columns this matrix has, we assume it's square
	c:count m;
	/we flatten m; m is now a list of length c*c	
	m:raze m; 
	/we create t, a table that is empty
          /and has columns id, and n 
	 /id will be the id of the neighborhood
	 /n will be a list of neighbors in that hood
	/gni is a helper function that get neighbor indexes
	/gni will accept the index as an offset 
	/ from the flattened matrix 
	/and will return coordinates as flat offsets.
	/internally it needs to represent 
             /the coordinates as a pair,
	/ we use standard div and mod to represent the pair 
	gni: { [x;y]
	  /x is the size of the matrix it will be fixed to c
	  /y is the point
	  /s whose neighbors we want to find
	  /ny are the new col coordinates
	      / found by dividing the point by c
	      / we add the offsets 
	      / but makes sure we are within boundaries of matrix
	      / min 0  and max c-1
    ny:0|(x-1)&0 -1 1 0+/:y div x;
		  /repeat this process to calculate row coordinates
    nx:0|(x-1)&-1 0 0 1+/:y mod x;
      /get back original flattened index
       /by multiplying ny by c and adding nx
      /flatten list and only return distinct points
	 distinct raze nx+x*ny
	 /project function onto c, since it will be constant 
	 / for the duration of the parent function 

   /f is our breadth first search function
   /we start by adding i, which is our first point, 
   /we then find all the gni of i,
   /we remove any gni[i] that are already present in i
   /and intersect the new points that are on (=1)  
   /by intersecting with j
   /we project this function on gni, 
   / which is called gni inside as well.
   /we also project on the points that are on 
   /(are equal to 1) aka (where m)
   / and which are named j inside the function)
	 f:{[gni;i;j](),i,j inter gni[i] except i}[gni;;where m];
   /repeating f over and over will yield
   /  all the neighbors of a point that are on (=1)

   /now we get to the outer loop, 
    /which will check 
    /if we have seen the current point  (p)
   /if we have seen it, we ignore it,
    /and return our current table (t)
   /otherwise, we call f on that point and return 
    /the point and all of it's neighbors
   /this will end up with our new table t, 
   /we only call this function on the points 
   	/that are turned on, 
   /which means we can skip checking if the point is on  
  /we will repeat this for all the points equal to 1      
	t:{[f;m;t;p]$[not p in exec n from t;
						t upsert (p;f over p);
						t]}[f;n]/[t;where m];

The entire code without comments looks like this, and is pretty short.

	c:count m;m:raze m; t:([id:()]n:());
	gni:{distinct raze (0|(x-1)&-1 0 0 1+/:y mod x)+x* 0|(x-1)&0 -1 1 0+/:y div x}[c];
	f:{[gni;i;j](),i,k where j k:gni[i] except i}[gni;;m];
	t:{[f;m;t;p]$[not p in raze exec n from t;t upsert (p;f over p);t]}[f;n]/[t;where m]

With that out of the way, I now want to explain an alternate way of achieving something essentially identical. 

A common technique in math and engineering fields is if you can’t solve a problem, find a similar problem that is easier that you can solve and solve that first.

So I pretended that all I needed to do was to find runs, which is a set of consecutive 1s in a row. I figured if I do this for all the rows and all the columns, I’ll have a starter set of neighborhoods and I can figure how to merge them later. 

Finding a run is really simple:

  1. set the neighborhood id to 0
  2. start at the beginning of first row
  3. look at each element,
  4. if you see a 1 add element to the list that corresponds to your current neighborhood id
  5. if you see a zero increment the neighborhood id
  6. if next row, go to the next row, increment neighborhood id, otherwise go to step 8 
  7. go to step 3
  8. remove neighborhood ids with no elements in them. 

In q, you can take a bit of a shortcut by asking each row, for the indexes with a 1 in them. So for example:

q)where 1 0 1 1 0
0 2 3
q) deltas where 1 0 1 1 0
0 2 1

One thing we notice is that consecutive differences  of 1 represent runs and those with any other number represent cut points, and if we cut at those points, we would get  (,0; 2 3) which is a list that contains two lists, each inner list is a run in that row. We can repeat this for the columns and create all the runs that exist row wise or column wise. 

Now we just need some logic to merge these runs. Here is an annotated version of the code that implements this method of finding all the neighborhoods. 

rr:{[m] /This will return all the points in the matrix that are on and their assigned neighborhood id
	/m is the square matrix
	c:count m; /c is the length of a side of the matrix 
	m:raze m; /m is now the flattened list that corresponds to the matrix
	/runs is a function that will return a list of lists, 
	/each inner list corresponds to points that are adjacent
	/Though the mechanism to calculate adjacency is in terms of y
	/It will return the Xs that correspond to those Ys. 
	/eg if we have a row 0 1 1 0 1 0 1 1, and the points are labeled `a`b`c`d`e`f`g`h
	/runs[0 1 1 0 1 0 1 1;`a`b`c`d`e`f`g`h] => (`b`c;,`e;`g`h)
	runs:{(where 1<>-2-': y) _  x}; 
	/now we want to apply runs to every row and column
	/n will be our initial list of neighborhoods,
	/ each neighborhood is a list that will contain adjacent elements either by row or by column
	/ elements are given a unique id
	/i will be our unique id , it is the index of the point that is turned on (=1)
	i:where m;
	/j groups all 'i's that are in the same row, by using integer division
	/so j is a dictionary whose keys are the row index and whose values are the 'i's 
	/we need to do this grouping to ensure that only points in the same row are considered adjacent
	/ otherwise we might believe that a diagonal matrix has runs (because we only see the column indexes)
	j:i group i div c;
	/since runs returns x, but uses 'y's to do the splitting, we provide j mod c as y
	/j mod c corresponds to the column location of each i
	/we then  apply runs on (')each j and j mod c. 
	/finally we flatten the result with a raze so that we just a have the list of row neighborhoods
	n:raze runs'[j;j mod c]; 
	/now we will repeat with the columns
	/we want to permute the list i so that the elements appear in column order
	/we want to preserve the labels we have in i
	/so we generate a list that is the length of m (c*c),
	/we reshape (#) to be a square matrix, and transpose it along the diagonal (flip)
	/we flatten(raze) it back out and intersect it with i, 
	/this preserves the column order we achieved from doing the flip, 
	/but only keeps the elements that are in i
	/k are the elements of i permuted along columns
	/eg if we have a matrix 
	//           0 0 1 
  	//           1 0 0
  	//           0 1 0
  	// i would be 2 3 7
  	// and k would be 3 7 2
	k:(raze flip (c;c)#til c*c) inter i);
    /now we will create the j  again which is our grouping 
    /but this time according to the column index so we will group by the integer remainder (mod)
    /this will separate our indexes into columns
    j:k group (k mod c);
    /then we will apply our runs function on each column
    / we use the div to get the row index of each element, 
    / so that runs can tell if two points are vertically adjacent
    / we flatten the result and append it to n
    n,:raze runs'[j;j div c];
    / at this point we have a list of all possible neighborhoods, but there are overlaps
    / so we create a table t, which will assign an id for each neighborhood contender
    / ungroup flattens the table so that every element belongs to a neighborhood id
    // here is a sample of what the table looks like before and after ungroup
    / before ungroup:
	/				n         id
	/				------------
	/				0 1       0 
	/				,3        1 
	/				,5        2 
	/				8 9 10 11 3  
	/after ungroup
	/				id n 
	/				-----
	/				0  0 
	/				0  1 
	/				1  3 
	/				2  5 
	/				3  8 
	/				3  9 
	/				3  10
	/				3  11
    t:ungroup `id xkey update id:i from ([]n);
    / f is our merging function.
    / It will merge neighborhoods together, 
    / by giving them the same id if there are points in common.
    / First we will update/create a column p which is the minimum id for a given element (n)
    / Then we will update all of the ids the with minimum p and rename that the new id
    / If two neighborhoods share a point in common, that point has two neighborhood id's
    / we will assign it primary id (p) which is the min of those two ids
    / we then group by id and assign the id to be the min of the primary ids (p)
    f:{[t]update id:min p by id from update p:min id by n from t};
    /On each iteration of this function we will merge some neighborhoods, 
    / we repeat until we can't merge anymore
	f over t}

The uncommented code is much shorter and looks like this:

c:count m;m:raze m; runs:{(where 1<>-2-': y) _ x};
n:raze runs'[j;(j:i group (i:where m) div c) mod c];
n,:raze runs'[j;(j:k group (k:(raze flip (c;c)#til c*c) inter i) mod c) div c];
t:ungroup `id xkey update id:i from ([]n);
f:{[t]update id:min p by id from update p:min id by n from t};
f over t}

At this point we are done and we have solved the puzzle in two slightly different ways. Looking at the performance of both methods, reveals that they are both O(n) essentially linear in the number of elements. However, the row method is actually faster, because everything is done as nice straight rows, so there is not much jumping around, in other words it is more cpu cache friendly. On my late 2011 Mac: for a 128 by 128 matrix
\t rr m
176 milliseconds
\t ff m
261 milliseconds

So it is about 66% faster

As a bonus, my father suggested that I expand this code into n dimensions. This is rather easy to do and I include the code here without too many comments. The trick is taking advantage of KDB’s vs (vector from scalar) function that can rewrite an integer as a number in another base:
so for example 3 vs 8 => (2;2) which are the coordinates of the point 8 in a 3 by 3 matrix. The point 25 is 2 2 1, in other words it’s in the last layer of a 3 dimensional cube of size 3, last column and first row. 
This means we can easily go back and forth between integer representation of the matrix and coordinate representation. Everything else is pretty much the same. 

c:count m;r:1; while[(count m:raze m)<>count[m];r+:1]; t:([id:()]n:());
gniN:{[r;c;n]a:neg[a],a:a where 1=sum each a:(cross/)(r;2)#01b;
distinct raze -1 _ flip c sv flip (c-1)&0|a +\: c vs n,-1+prd r#c}[r;c];
f:{[gni;i;j](),i,j inter gni[i] except i}[gniN;;where m];
ps:where m;
while[count ps;t:t upsert (first ps;k:f over first ps); ps:ps except k];
/functional version is slower
/t:{[f;t;j]q+::1;$[not j in raze exec n from t;t upsert (j;f over j);t]}[f]/[t;where m]

c:count m;r:1; while[(count m:raze m)<>count[m];r+:1]; runs:{(where 1<>-2-': y) _ x};
j:(c sv (til r) rotate\: c vs til prd r#c) inter\: where m;
g:til[r] except/: d:reverse til r;
n:raze raze each runs''[j@'k;@'[v;d]@'k:('[group;flip]) each @'[v:flip c vs j;g]];
t:ungroup `id xkey update id:i from ([]n);
f:{[t]update id:min p by id from update p:min id by n from t};
f over t}


This next article was inspired by two sources:

An Advent of Code puzzle from 2017 

And a playing with J article:

An evolute is a matrix whose cells are numbered to spiral out from the center.

17  16  15  14  13
18   5   4   3  12
19   6   1   2  11
20   7   8   9  10
21  22  23---> ...

The first question asks us to find the Manhattan distance between some positive integer cell and the 1st cell.

I see two ways to do this:

  1. Build the evolute, find the location of the integer and then calculate the distance
  2. Calculate the coordinate with respect to the center directly and take the sum of the absolute value of the coordinates. 

I will start with the second method because it is simpler. If we look at the structure of the evolute we will see that each new layer of the matrix will have an odd square in it’s bottom right corner. 1 9 25 49 ….

That means we can locate the layer by finding the nearest odd square root. Here is the code for that:

f:{j:floor sqrt x; j - not j mod 2}
q)f 10
q)f 25
q)f 24

Next we can find the grid coordinate or how far we are from the center for that corner. I plugged in max x so we can use the function on lists of numbers. The ‘?’ verb is being used to find index which corresponds to the steps away from the center.

g:{(1+2*til max x)?f x} 
q)g 10
q)g 25
q)g 24

Now all we need to do is figure out where we are on the layer.

We can be in one of 5 places:

  1. Exactly at the end of a layer
  2. Between the bottom right corner and the top right corner 
  3. Between the top right corner and the top left corner
  4. Between the top left corner and the bottom left corner
  5. Between the bottom left corner and the end of the layer

We can divide by the size of the layer to calculate this. We can also find the remainder to know how far between we are. This gives us the following code:

place:{(x-j*j) div 1+j:f x}
offset:{(x-j*j) mod 1+j:f x}
q)place 1+til 25
0 0 1 1 2 2 3 3 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 0
q)offset 1+til 25
0 1 0 1 0 1 0 1 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0

This allows us to write the full function to calculate the coordinate. All we need to do is to check which of those conditions are met fill in the coordinates from g, plus the appropriate offset and correct sign. 

f:{j:floor sqrt x; j - not j mod 2};
g:{(1+2til max x)?y}; place:{(x-yy) div 1+y};
offset:{(x-yy) mod 1+y}; j:f x; brc:x=jj;
?[brc;flip (grid;neg[grid]);
?[0=p;flip (grid+1;neg[grid]+o);
?[1=p;flip (grid+1-o;grid+1);
?[2=p;flip (neg[grid+1];1+grid-o);
flip (neg[grid+1]+o;neg[1+grid])]]]]
q)coordinate 1+til 9
0 0
1 1
1 1
0 1
-1 1
-1 0
-1 -1
0 -1
1 -1

At this point we can find the coordinate of any cell number. Since we can do this for all the points, we should be able to create the evolute for any dimension by generating all the coordinates and then shifting them to the appropriate column/row number and inserting them into that matrix. We can do this cleverly by sorting the row and column separately, since we want a particular orientation and sorting each of them corresponds to either a horizontal transposition or a vertical transposition. 

evol:{t:`row`col!(x div 2)+flip cordinate j:1+til x*x; (x;x)#exec val from `col xdesc `row xasc flip update val:j from t} 
q)evol 7
37 36 35 34 33 32 31
38 17 16 15 14 13 30
39 18 5 4 3 12 29
40 19 6 1 2 11 28
41 20 7 8 9 10 27
42 21 22 23 24 25 26
43 44 45 46 47 48 49

Now that we can create a evolute and we see that given an evolute we can find the coordinates, we also see that permuting the row and column indexes gives us different orientations. This leads us to see how Joey Tuttle/Eugene McDonnell created an evolute from  scratch. Let’s raze an evolute:

q)raze evol 5
17 16 15 14 13 18 5 4 3 12 19 6 1 2 11 20 7 8 9 10 21 22 23 24 25

Now let us permute it by the magnitude and then look at the differences:

q)iasc raze evol 5
12 13 8 7 6 11 16 17 18 19 14 9 4 3 2 1 0 5 10 15 20 21 22 23 24
q)deltas iasc raze evol 5
12 1 -5 -1 -1 5 5 1 1 1 -5 -5 -5 -1 -1 -1 -1 5 5 5 5 1 1 1 1
q)deltas iasc raze evol 3
4 1 -3 -1 -1 3 3 1 1

We now see a pretty clear pattern. We are repeating a pattern of 1,neg x,-1,x. 
Each time we are taking 2 elements from the pattern. We take an increasing number of them and continue until we take x elements in shot. Here is Eugene’s explanation translated into q

g:{(-1;x;1;neg x)}
k:{(f x)#g x}
h:{1+til x}
j:{-1 _ raze 2#'h x}
l:{-1 rotate raze (j x)#'k x}
m:{iasc sums l x}
n:{(x;x)#m x}
q)f 5
q)g 5
- -1 5 1 -5
q)k 5
-1 5 1 -5 -1 5 1 -5 -1
q)h 5
1 2 3 4 5
q)j 5
1 1 2 2 3 3 4 4 5
q)l 5
-1 -1 5 1 1 -5 -5 -1 -1 -1 5 5 5 1 1 1 1 -5 -5 -5 -5 -1 -1 -1 -1
q)m 5
24 23 22 21 20 9 8 7 6 19 10 1 0 5 18 11 2 3 4 17 12 13 14 15 16
q)n 5
24 23 22 21 20
9 8 7 6 19
10 1 0 5 18
11 2 3 4 17
12 13 14 15 16

Combining this into function we get:

evolute:{(x;x)#iasc sums -1 rotate raze (-1 _ raze 2#'1+til x)#'(-1+2*x)#(1;neg x;-1;x)}

We can shorten it just a bit and make it a bit faster by seeing that grabbing parts j,k,l can make use of how kdb overloads ‘where’. In KDB ‘where’ gives you the indexes of the 1s in a boolean mask. For example:
q)where 101b
0 2
However, a little thought reveals that we are returning the index of the element the number at that index. So we return one 0, zero 1,one 2. Generalizing this, we can think that where of 1 2 3 should return one 0,two 1s and three 2s and indeed KDB does this.
q) where 1 2 3
0 1 1 2 2 2
Knowing this, we can rewrite j k and l, which make heavy use of each and remove all the razing. 
I am going to show how I built this up. 

q){((x-1)#2),1} 5
2 2 2 2 1
q){1+where ((x-1)#2),1} 5
1 1 2 2 3 3 4 4 5
q){(where 1+(where ((x-1)#2),1))} 5
0 1 2 2 3 3 4 4 4 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 8
q){(where 1+(where ((x-1)#2),1)) mod 4} 5 /so that we cycle
0 1 2 2 3 3 0 0 0 1 1 1 2 2 2 2 3 3 3 3 0 0 0 0 0
q){(1;neg x;-1;x)(where 1+(where ((x-1)#2),1)) mod 4} 5
1 -5 -1 -1 5 5 1 1 1 -5 -5 -5 -1 -1 -1 -1 5 5 5 5 1 1 1 1 1
q){-1 rotate (1;neg x;-1;x)(where 1+(where ((x-1)#2),1)) mod 4} 5
1 1 -5 -1 -1 5 5 1 1 1 -5 -5 -5 -1 -1 -1 -1 5 5 5 5 1 1 1 1

The last step recreates l. So the whole function using where looks like this:

evolute:{(x;x)#iasc sums -1 rotate (1;neg x;-1;x)(where 1+(where ((x-1)#2),1)) mod 4}

Part 2 of the advent of code is even more interesting. It describes us filling the spiral by summing the neighbors, the center square starts at 1. Here is the description from advent.

Then, in the same allocation order as shown above, they store the sum of the values in all adjacent squares, including diagonals.
So, the first few squares' values are chosen as follows:
Square 1 starts with the value 1.
Square 2 has only one adjacent filled square (with value 1), so it also stores 1.
Square 3 has both of the above squares as neighbors and stores the sum of their values, 2.
Square 4 has all three of the aforementioned squares as neighbors and stores the sum of their values, 4.
Square 5 only has the first and fourth squares as neighbors, so it gets the value 5.
Once a square is written, its value does not change. Therefore, the first few squares would receive the following values:
147  142  133  122   59
304    5    4    2   57
330   10    1    1   54
351   11   23   25   26
362  747  806--->   ...

What is the first value written that is larger than your puzzle input?

Now given our earlier coordinate function, we just need to add elements in into empty matrix of 0s and sum the neighbors at each step to determine what to add.  First, let’s write the simple pieces of finding the neighbors and summing them. Then we will conquer the composition.

/To find the neighbors according to our rule, 
/ we basically find the coordinates for (1..9)
/ then we can add to x pair so it's centered around that one
neighbors:{x+/:coordinate 1+til 9}
q)neighbors (1;0)
0 1
1 1
1 2
0 2
-1 2
-1 1
-1 0
0 0
1 0
/We need to shift our coordinate system depending on the size of the matrix
shift:{i:div[count x;2]; (i+y)}
q)shift[3 3#0;coordinate 5]
0 2
/calculate the sum of the neighbors
sumN:{sum over .[x] each neighbors y}
q)(1+evolute 5)
17 16 15 14 13
18 5 4 3 12
19 6 1 2 11
20 7 8 9 10
21 22 23 24 25
q)(1+evolute 5)[2;3]
/Manually sum the neghbors of 2
q)sum 4 3 12 1 2 11 8 9 10
q)sumN[1+evolute 5;(2;3)]

Okay, we are going to start with a matrix that has single 1 in the center.

center1:{.[(x;x)#0;(a;a:x div 2);:;1]} /we are ammending a 1 at the x div 2
q)center1 5
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0

Now that we see how easy it is to amend an element at a particular index, we can combine this with the notion of a fold(over). The idea is that we will keep updating this matrix with new elements based on the neighbor sum:

cumEvolute:{{.[x;y;:;sumN[x;y]]}/[j;shift[j:center1[x]] coordinate 1+til (x*x)]}
q)cumEvolute 5
362 351 330 304 147
747 11 10 5 142
806 23 1 4 133
880 25 1 2 122
931 26 54 57 59
/If we want to rotate it we can simply reverse flip
q)reverse flip cumEvolute 5
147 142 133 122 59
304 5 4 2 57
330 10 1 1 54
351 11 23 25 26
362 747 806 880 931

Using cumEvolute we can find when we generate the first integer bigger than the input, by creating a stopping condition and counting the number of iterations.

stopCum:{{$[z<max over x;x;.[x;y;:;sumN[x;y]]]}/[j;shift[j:center1[x]] coordinate 1+til (x*x);y]}
0 0 0 0 0
0 11 10 5 0
0 23 1 4 0
0 0 1 2 0
0 0 0 0 0
q)max over stopCum[5;20]
q)max over stopCum[10;100]

Controlling the Tower

I recently came across this problem and it looked interesting enough, so I solved it. It’s from Topcoder, which is nice because they provide test cases so you can be sure that you have solved it.

Unfortunately, Topcoder has dracanion copyright laws, so I will paraphrase the problem in case the link to the description dies. You want to control some territory, the territory has t (an integer) towers and each tower is a certain number of stories s (an integer). To control the territory you must control the majority of the active towers. A tower is active if someone controls the majority of stories in a tower. If a tower has no majority controller, it is inactive and does not count. What is the minimum number of stories you must win to guarantee control of the territory?

The first thing that helped me was converting the test cases into the following two lists and created a table:


q)select results, inputs from asc t  / just to give a visual sample

results inputs
1      1
2      2
2      3
3      4
3      5
4      6
24     47
2      1 1
2      1 1 1
4      1 1 1 1 1 1
699  1 1 100 100 100 100 200 200

That way, I could write a function and start testing and seeing if I was making progress.

If there is only one tower, then I obviously need to take it, which means I need to take the majority.

So that is pretty easy:

f{[x] if[count[x]=1; :sum 1+floor x%2]; :0} /one tower,
/ otherwise return 0 this is wrong but we want to return a number

I can run my function and see which cases I still haven’t solved:

q) select myResult, results, inputs from (update myResult:f each inputs from t ) where myResult<>results

This gave me a way to check if I was making progress.

Next, we know that we need to win the majority of the towers and we need to win them in the most expensive way possible in order to guarantee a victory because we don’t get to chose how our stories our distributed. So we sort the towers desc by height and fill up half of them. Then comes the tricky part, the opposition can have a simple majority in all the other towers. In that case, there will be one final tower that decides everything.

For example, suppose there are 5 towers, 6 6 5 5 4. We need to win three towers. So we win the first 2 by getting 12 stories. Then we only need one more tower but the opposition can strategically win the shortest towers. It’s important to note that winning a 4 story tower and a 5 story tower takes 3 either way. So they can do better by winning the two 5 story towers and simply stop me from winning the 4 story tower by taking 2 stories.

This explains the following slightly convoluted algorithm:

if[count[x]=1;:sum 1+floor x%2]; /one tower
t:floor count[x]%2; / majority of towers, one less if there are an odd number
v:sum t#x:desc x; /most expensive way to win t towers
r:reverse (t-count[x])#x; /remaining towers
o:sum 1+floor %[;2] t#r; /opposition needs to take at least t towers
o+:sum floor 0.5+%[;2] d _ r; /oppositions needs to prevent the rest
o-:(any(d _ r)mod 2)*(any not(d#r)mod 2) ;
/subtract if opposition wins during preventing
/ and you can strategically exchange an odd tower with an even one

If anyone has a simpler, more straightforward approach, or wants to tell me how my approach is wrong, I would love to hear about it.

Commutative Business Days

The concept of business days is pretty straightforward. For most people, business days are the 5 days a week we commute to work.

Questions like:

  1. What is the next business day?
  2. What was the previous business day?
  3. List the next 5 days.

are all pretty straightforward and I expect that the average high school graduate can answer it with the aid of a calendar (assuming it includes holidays).

For some reason, the implementations I’ve seen have been pretty hideous. So here is, in my opinion, an elegant solution that also demonstrates some neat features of KDB/Q.

The first step is to get all the business days in the year. (Depending on your region/industry you might have different days. I’ll use USA business days M-F with 10 government holidays (see US court business days).

/1 calendar year
year:2018.01.01+til 365
holidays:2018.01.01 2018.01.15 2018.02.19 2018.05.28 2018.07.04 2018.09.03 2018.10.08 2018.11.12 2018.11.22 2018.12.25
yearNoWeekend:year where in[;2 3 4 5 6] year mod 7
buisnessYear:yearNoWeekend except holidays
Once you have the buisnessYear for the relevant period

we need to be aware that the calendar will not work correctly for data we have not loaded

We can create a data structure that will allow us to answer all these types of questions.
The data structure we will be using is a sorted dictionary.
We will have a dictionary, nDay (next day) and a pDay function (previous day).
We can look up any date in our dictionary/function to find the next or previous date.
We use a sorted dictionary so that when we look up non buisness days,
q will return the value of the closest next/previous date.
/create our dictionary and function,
/fill the last day in the next day list with the first day from 2019
nDay:`s#buisnessYear!`s#2019.01.02^next buisnessYear
pDay:{[nDay;d]p:nDay?d; ?[null p;?[nDay] nDay d;p]}[nDay]

/let’s find the next day July 3rd
nDay 2018.07.03
/as expected returns 2018.07.05
nDay 2018.07.04
/as expected also returns 2018.07.05

/let’s find the previous day
pDay 2018.07.05
/as expected returns 2018.07.03
pDay 2018.07.04
/as expected returns 2018.07.03

/let’s find 5th day after a given date
/as expected we get 2018.07.09
/lets find 5th day previous to a given date
/as expected we get 2018.06.25

/We can ask to get the next 5 days
/we get the given day and next 5 days
/2018.07.01 2018.07.02 2018.07.03 2018.07.05 2018.07.06 2018.07.09
/We can ask to get the previous 5 days
/we get the given day and previous 5 days in reverse order
/2018.07.01 2018.06.29 2018.06.28 2018.06.27 2018.06.26 2018.06.25

/We can ask for the next days for 5 dates
nDay 2018.01.14 2018.02.18 2018.05.27 2018.09.02 2018.10.07
/ we get 2018.01.16 2018.02.20 2018.05.29 2018.09.04 2018.10.09

/We can ask for the previous days for 5 dates
pDay 2018.01.16 2018.02.20 2018.05.29 2018.09.04 2018.10.09
/we get 
2018.01.12 2018.02.16 2018.05.25 2018.08.31 2018.10.05

Bonus Section, and the real reason for the title of the essay:

Here is a final piece that might strike you as a bit weird until you untangle it.

In General:

0b=(pDay nDay d)~(nDay pDay d) 

In other words, these functions don’t commute. This seems strange since nDay is simply addition by one and pDay is simply subtraction by one.

This was used by the mathematician Peano to show how you can get all the natural numbers. Simply start with 1 and keep adding 1. If you want whole numbers start with 0. If you want integers create the concept of subtraction and you can count your way to negative infinity.

This is called Peano arithmetic and it commutes. You can always get back to where you started by simply adding or subtracting one appropriately. The order in which you do so does not matter.

However, business days exhibit a certain property that prevents them from commuting. Namely, although any calendar day has a next business day, not all days are business days. Which means that non-business days are not represented intuitively. For example, the next business day of Saturday is Monday and previous business day of Monday is Friday, but the previous business day of Saturday is Friday and so the next business day of Friday is Monday. Depending on whether we started with pDay or nDay we get different results.

The missing weekend from the business days illustrates the reason why, in general, floating arithmetic does not commute. There are numbers that cannot be represented by floating point arithmetic which means that sometimes the order of additions and subtractions will make a difference to the final result.


Backtracking In Q/ word ladder

This is a variation on the word ladder puzzle. Here is the puzzle:

You are presented with a set of strings (assume they are unique). Each string can be of arbitrary length. Your goal is to find the length of the longest chain.
A chain can start from any string. Each following link in the chain must be exactly one letter shorter than the previous link, any character from any position from the previous link can be dropped, this new link must appear in the set of words given.

For example, if you are given the set  of strings (“a”,”ab”,”abc”,”abdc”, “babdc”,”dd”, “ded”)

“ded” -> “dd”
“babdc” -> “abdc” -> “abc” -> “ab” -> “a”

are valid chains. The longest chain’s length is 5.

First I will present python code that my friend wrote, it is very concise and has a kdb flavor (at least in my mind).


Python Code:

def word_list(array):
  # Storing the solutions for each word in a dictionary
  dict_array = {word:0 for word in array}

  # Sorting the word list by lengths
  array.sort(key = lambda s: len(s))

  # Finding solutions to each word by using the previous solutions
  for word in array:
    options = [0]
    for i in range(len(word)):
      word_minus_i = "".join([word[k] for k in range(len(word)) if k != i])
      #check if the word appears in list of words
      if word_minus_i in dict_array:
        # If it does add a link to that word
    dict_array[word] = max(options)
  return max(dict_array.values())+1

Now to a KDB version:

I had two versions of the code both rely on the same insight as the python code

The difference is in the post processing.

The key insight into this problem is that if you hash all the strings. Then if you create every possible one letter drop you can check if that string is in the legal set of strings, if it is you can record which longer string it maps to.

At that point you have a list of every connection between two strings. Since checking the hash takes constant time and generating every 1 letter missing string takes 1- number of letters in the string* number of unique strings. This processing takes linear time.

At that point, we have a graph, and at first I wanted to reuse the code from the previous post. I would generate all the connected components, the use those nodes to figure out how many different length strings there were. The connected component with the largest number of distinct length strings is going to be the longest chain and the number of distinct lengths is the total chain length. (this code is [3-6]x faster for small inputs (10,000, 100,000 words) than the python but is about the same speed once we get up to a set of 1 million words)

The second version, uses a quicker technique to just calculate the depth of the graph from every node.

Let’s start from the first version:
Let’s generate 10,000 strings (.Q.a is “abcdefghijklmnopqrstuvwxyz”)
We want strings of length between 1 and 20 (1+til 20)
We want ten thousand strings, so we take 10000 of the list 1 2 3 .. 20
We then sample the number from the alphabet (3?.Q.a) gives us a 3 letter string
We do this for each right (\:) number in the list
We only want the distinct words

q)words:distinct words:(10000#1+til 20)?\:.Q.a

Here we are sampling 10 of these words:


So we need a function that will generate the indexes of all the possible 1 letter drops of a word.
The most intuitive way I know to do this, is to create equal length boolean string with one 0 and rotate that 0 around the whole string. So for example in the 3 letter word case we have:
We then find the indexes where there is a 1 and that would gives all the 1 missing indexes.

/0b,(x-1)#1b creates a string of length x with 1 0 value
/rotate\: says to rotate this string for each right argument (which is just the range from 0 to x)
/where finds the indexes that are 1
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b}

Lets do a little processing on the list of strings, and put them into a table:

/sort the words by count of the number of letters and only store the distinct words in hwords

q)hwords:distinct words iasc count each words

/create a table, with the distinct words, the number of letters in the word, and a symbolic version of the word for fast lookup, with a unique attribute so that KDB knows to hash this list

q)show t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
wsym w c
s ,”s” 1
t ,”t” 1
j ,”j” 1
n ,”n” 1

/wsym contains a symbolic version of the word
/w is just the original word
/c is the length of the word

We then want to populate a dictionary with all the indexes, so we select the distinct counts of the words and run them through the allDrops function

q)drops:n!allDrops each n:exec distinct c from t /we select the distinct counts run the function and key the distinct counts to the result from the function.
1 | ,`long$()
2 | (,1;,0)
3 | (1 2;0 1;0 2)
4 | (1 2 3;0 1 2;0 1 3;0 2 3)
5 | (1 2 3 4;0 1 2 3;0 1 2 4;0 1 3 4;0 2 3 4)

We do this mostly because we know that we will need these indexes many times, but there are relatively few distinct lengths of words.

Now we calculate all the links:

/drops c will give us a list of lists of all the 1 letter missing possible indexes for each word.
/To show what this does let’s select by c so that we can see what it does for each c
select d:drops first c by c from t
c | d ..
–| ————————————————————————-..
1 | ,`long$() ..
2 | (,1;,0) ..
3 | (1 2;0 1;0 2) ..
4 | (1 2 3;0 1 2;0 1 3;0 2 3)

Because q will apply a particular list on indexes and preserve the shape, we just need to tell it, that each word should be projected onto all of these lists. This is done with the @’ this will pair off each list of lists with the words. Here is what that looks like , again grouping by c so that we can see what that looks like:

q)select by c from update d:(w)@’drops c from t
c | wsym w d ..
–| ————————————————————————-..
1 | x ,”x” ,”” ..
2 | cw “cw” (,”w”;,”c”) ..
3 | fvw “fvw” (“vw”;”fv”;”fw”) ..
4 | yoku “yoku” (“oku”;”yok”;”you”;”yku”)

We see that one letter words, become empty strings, two letter words become a list of words each 1 letter long. three letter strings become a list of 2 letter words.
Now we just need to find where these words are in our original table. If they are not there they will get the 1+last index, which is how q lets you know that a value is missing in a lookup. We will also cast all these strings to symbols so that we can do faster lookups (`$):

/col is the index of the original string and row are all the matches. We are modeling the graph using an adjacency matrix idea.
q)show sparseRes:update col:i, row:t[`wsym]?`$(w@’drops c) from t
wsym w c col row
h ,”h” 1 0 864593
r ,”r” 1 1 864593
f ,”f” 1 2 864593
w ,”w” 1 3 864593
/We see that no 1 letter string has any matches which is why all of the indexes are last in the table, this makes sense since one letter strings can’t match anything but the empty string and all of our strings have at least one character
/Again selecting by c so we can see a variety of lengths,
q)select by c from update col:i, row:t[`wsym]?`$(w@’drops c) from t
c | wsym w col row ..
–| ————————————————————————-..
1 | x ,”x” 25 ,864593 ..
2 | cw “cw” 701 3 11 ..
3 | fvw “fvw” 17269 541 275 357 ..
4 | yoku “yoku” 64705 864593 7279 2368 3524 ..
5 | vvbud “vvbud” 114598 864593 46501 864593

As we can see the row column is a list of matches and also contains false matches that go to the end of the table we would like to clean that up. We can do that using ungroup.
Ungroup will take a keyed table, and generate a row for each item in the list of records.
In our case it will create a col, row  record for each item in the row column.

q)show sparseRes:ungroup `col xkey sparseRes
col wsym w c row
0 h h 1 864593
1 r r 1 864593
2 f f 1 864593

Now we want to delete all of the fake links, so that will be wherever the row is the length of the original table:

q) show sparseRes:delete from sparseRes where row=count t
col wsym w c row
26 bj b 2 2
26 bj j 2 10
27 oj o 2 2
27 oj j 2 12

At this point we are done processing the data.  We have found all the connections between the words, all that is left is to find the connected components, which we know how to do from the previous post. I use a slightly different function that calculates the connected components on a sparse matrix. That is represented as a table with columns row and col.

neighbors: exec col from m where row in j; /now we are searching a table instead of a matrix
f:{n:exec col from y where row in .[_;x]; /new neighbors
x[0]:count x[1];x[1]:distinct x[1],n; /update the two pieces of x
x}[;m]; /project this function on the sparseMatrix
last f over (0; neighbors)};

points:til max m[`row]; /
while[count points;
i:first points;
connected,:enlist n:`s#n:distinct asc i,findConnectedSparse[i;m]; /add point in case it is island
points:points except n];

Given these two functions that work on sparse adjacency matrixes. The final step is:

q)max {count select distinct c from x} each t allComponentsSparse sparseRes

Let’s unpack this a bit:

sparseRes is our table of links.

allComponentsSparse will return all of the connected components in this graph.

`s#0 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 3..
`s#1 62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 2..
`s#2 29 40 66 91 108 112 114 125 135 154 167 168 184 211 226 245 267 281 291 ..
`s#3 69 77 94 102 109 110 119 129 159 183 192 196 212 214 216 220 224 247 259..

apply the table t to this will give us a table that is indexed on the connected components


So for each connected component we are getting a table that has only those strings that match.
let’s look at the first one:

q)first t allComponentsSparse sparseRes
wsym w c
q ,”q” 1
aq “aq” 2
oq “oq” 2
qw “qw” 2
wq “wq” 2
jq “jq” 2

We can see immediately a problem, we only need one 2 letter word but we are getting every possible match. Since we are only interested in the length we can simply count the number of distinct c.

q)count select distinct c from first t allComponentsSparse sparseRes

So we see that the first connected component has 4 links in the chain.

Doing this for each connected component and then selecting the max ensures that we have found the length of the longest chain.

q)max {count select distinct c from x} each t allComponentsSparse sparseRes

That completes the first method.

Here is the whole function together

hwords:distinct words iasc count each words;
t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b};
drops:n!allDrops'[n:1+til max t[`c]];
sparseRes:ungroup `col xkey update col:i, row:t[`wsym]?`$(w@’drops c) from t;
sparseRes:delete from sparseRes where row=count t;
max {count select distinct c from x} each t allComponentsSparse sparseRes}

neighbors: exec col from m where row in j;
f:{n:exec col from y where row in .[_;x]; /new neighbors
x[0]:count x[1];x[1]:distinct x[1],n; /update the two pieces of x
x}[;m]; /project this function on the sparse matrix
last f over (0; neighbors)};

points:til max m[`row];
while[count points;
i:first points;
connected,:enlist n:`s#n:distinct asc i,findConnectedSparse[i;m]; /add point in case it is island
points:points except n];

Now for method 2, instead of relying on the previous algorithm to find all the connected components, we just want to keep track of how deep each chain is.

We can do this by creating a new table q which only has columns, row and col and is grouped by row. Since we created the original links by going from columns to rows, we know that the links travel up from shorter strings(row) to longer ones(col)

q)show q:select col by row from sparseRes
row| col ..
—| ————————————————————————..
0 | 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 3..
1 | 62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 2..
2 | 29 40 66 91 108 112 114 125 135 154 167 168 168 184 211 226 245 267 281 ..
3 | 69 69 77 94 102 109 110 119 129 159 183 192 196 212 214 216 220 224 247 ..
4 | 38 48 60 74 86 92 95 115 123 131 132 135 145 184 195 197 213 238 246 249..
5 | 43 49 119 124 137 142 143 180 219 226 231 239 246 277 289 303 308 319 32..
6 | 28 31 72 73 88 100 104 126 141 145 161 175 190 198 235 240 248 253 289 3..

Now since this table is keyed on the row column, we can index into the table using a index table that has the row numbers and keep doing this until there are no more rows to return. The number of times we can index before we return nothing, is the length of the chain from that row.

First lets demonstrate indexing:

q)indexTable:([]row:0 1)

q)q ([]row:0 1)
28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 342 36..
62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 272 27..

We get the two rows that match from q.

If we flatten this list and rename this column row. We can reindex into q,

q)q select row:raze col from q ([]row:0 1)
452 549 634
391 391
587 786 803
395 493 759
524 629
704 834
613 823

We see the next level of neighbors, we can keep doing this. Again we can take advantage of over and just count how long before we return an empty table.

We will do this by creating a function that takes a dictionary and updates it, that way we can store both the intermediate results and the number of times we went down and all the nodes we visited.

the keys will be cur, which is the list of nodes we wish to visit next, depth, which is how deep we gone so far, and visited a list of all the nodes we saw. Then  we can write a function dive which will dive one level down into the table.

dive:{ $[count x[`cur]:select row:raze col from y x[`cur];
/index the table on the current neighbors we are exploring.
/We set cur to be the next level of neighbors
/if there are any new neighbors:
/we will add 1 to depth, and add the current new neighbors to visited and return x
[x[`depth]+:1;x[`visited],:exec row from x[`cur];
/otherwise we will return x

We then create a table of all the nodes we would like to visit with the initialized keys:

q)show n:update visited:count[n]#() from n:([]depth:0;cur:exec row from q);
depth cur visited
0 0
0 1
0 2
0 3
0 4

lets’ run dive on one the first of n:

q)first n
depth | 0
cur | 0
visited| ()
q)dive over first n
depth | 4
cur | +(,`row)!,()
visited| 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 3..

We see that we get back a dictionary of all the nodes visited as well as, the maximum depth achieved. If we do this for all n, we can select the max depth and we are done.

q)exec max depth from (dive/)each n}

Notice, we are storing all the visited nodes, so we can reconstruct what the path actually was, but we are not actually using it. So the entire second version looks like this:

hwords:distinct words iasc count each words;
t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b};
drops:n!allDrops'[n:1+til max t[`c]];
sparseRes:ungroup `col xkey update col:i, row:t[`wsym]?`$(w@’drops c) from t;
sparseRes:delete from sparseRes where row=count t;
q:select col by row from sparseRes;
dive:{ $[count x[`cur]:select row:raze col from y x[`cur];
[x[`depth]+:1;x[`visited],:exec row from x[`cur];:x]
n:update visited:count[n]#() from n:([]depth:0;cur:exec row from q);
exec max depth from (dive/)each n}

On my computer, I was able to find the length of the longest chain for a list of million words in under 15 seconds, using this second version and in a minute using the first version. If I was using less than one hundred thousand words, then  the two versions were about the same.

The python code took about a minute for a million words, (without using anything fancy, I’m sure it can be improved by replacing pieces with numpy components or using the latest version of python 3.7. where dictionary look ups are faster).

The real benefit here from KDB, is that reconstructing the path is super straightforward since we have the path indexes and can use them to index into the word table.



Graph Algo, Finding Friends

This is a rather standard interview question:

Suppose that you have points and the points are connected by edges. Imagine that the points are numbered, then we can represent whether there exists a connection between two points by writing all the pairs of points.

We can also represent this graph, as matrix, where 1 means that there exists a connection between i and j and 0 otherwise. We call this matrix the adjacency matrix since it tells us which points are connected.adjacencymatrix_1002

Given such a matrix, we want to find all the connected components. A connected component is either just one node if it has no connections or all the nodes that can be reached by traveling along the edges.

In the picture above, each of these graphs has only one connected component, but together there are three distinct connected components, since there is no way to reach from the first graph to the second by traveling along an edge.

First I will show my original solution in KDB. Then a slight improvement.
First note that the matrix must be symmetric, since if point 1 is connected to point 2, then point 2 is connected to point 1.

I wanted to write a function that would get all the points that were connected to a single point.

If I think of a particular row in this matrix, then all the 1s represent the neighbors of this point. If I then find all of their neighbors, and so on, I will have all the points that can possibly be reached.

First imagine I have a matrix, a:
To create it, I first create 100 zeros (100#0)
Then I pick six places to add connections, (6?100)
I replace the zeros with 1s.
I add this matrix to itself flipped, so that it is symmetric (a+flip a)
Divide by two so that any place that 1s overlapped are either 1 or .5, (.5*)
Then I cast it to an integer so that they are all 1s or 0s (`long$)

q)a:`long$.5*a+flip a:10 10#@[100#0;6?100;:;1]
0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0

Okay given a, I can find the neighbors of any of the 10 points.

0 0 0 0 0 1 1 0 0 0
q)where a[1]
5 6

So we get a list of points that are neighbors.

We need to create a function that will keep asking for neighbors and discard any duplicates.

We can find all the neighbors of the neighbors by selecting the neighbor rows of matrix  and then asking where for each row:

q)where each a 5 6

We get a list of neighbors.

We can then raze this list to flatten it:

q)raze where each a 5 6
1 1

Then we can ask for the distinct items:

q)distinct raze where each a 5 6

Now we would like to add this list to our original neighbors and check if we can find more neighbors:

q)distinct raze where each a 5 6 1
1 5 6

Running this again we get:

q)distinct raze where each 1 5 6
5 6 1

So there are no more neighbors to find, just the order changes and we will flip back and forth.

Anytime, we need to apply a function like this we can make use of the adverb in q called over. Over will apply the function until the results are the same for two successive iterations.

So my function to find all connected with comments looks like:

neighbors: where m[i]; /find the initial list of neighbors
f:{distinct raze x,where each y x}[;m]; /create the function where second argument     is fixed to be the adjacency matrix.
f over neighbors} /run the function over the initial neighbors and keep adding neighbors until there are no more to add.

/testing on a:
5 6 1

As expected this gives us all points connected to point 1, including point 1 itself.
To list all of the connected components we need to run this on each of the points:

q)findAllConnected[;a] each til count a
3 0
5 6 1
8 2 9
0 3
7 4
1 5 6
1 5 6
4 7
2 9 8
8 2 9


We get 10 rows specifying the connections. Now we need find the unique set.

The easiest way to do this is to sort each one and ask for the distinct lists:

q)distinct asc each findAllConnected[;a] each til count a
`s#0 3
`s#1 5 6
`s#2 8 9
`s#4 7

Which gives us 4 connected components.

We can look for the islands, by checking if there are any elements not on any of these lists. If so, they must unconnected to anything else.

Items not in the list are islands (only one node no connections)

q)connected:distinct asc each findAllConnected[;a] each til count a
q)islands:(til count a) except raze connected

q)count each connected /size of the components

We get an empty list, which means there are no islands to report.

Problem solved.

For those who have followed, you can see that we are actually doing a lot of duplicate work, since we keep asking for neighbors for points that we have already seen. To solve this we simply keep track of how many neighbors we have found so far and only query for the neighbors we have yet to see.

Instead of x being a list of neighbors, it is now a list with two items. The first is the number of neighbors found so far and the second is the list of neighbors.

We can use dot(.) application of drop(_) to remove the neighbors we have seen so far from the neighbors list. (.[_;x]):


neighbors: where m[i];

f:{n:raze where each y .[_;x]; /new neighbors

x[0]:count x[1];x[1]:distinct x[1],n; /update the two pieces of x

x}[;m]; /project this function on the matrix

last f over (0; neighbors)} /apply the function on the neighbors, with 0 currently visited

/running on a:

5 6 1

/large sparse matrix to test with

q)c:7h$.5*c+flip c:1000 1000#@[1000000#0;1000?1000000;:;1]

/Time both on my 2012 macbook
q)\t distinct asc each findAllConnected[;c] each til count c
q)\t distinct asc each findAllConnectedFast[;c] each til count c

A 10x improvement, the keen observer will notice that we don’t have to run the function on all the points in the first place.  We can go down the points and skip all the points that we have already seen. Here is what that looks like:


points:til count m;


while[count points;

i:first points;

connected,:enlist n:`s#n:distinct asc i,findAllConnectedFast[i;m]; /add point in case it is island

points:points except n];


q)allComponents a
`s#0 3
`s#1 5 6
`s#2 8 9
`s#4 7

q)\t allComponents c

This results in a 1000x increase speed up. Because we are only calling the function the minimum number of times.

Stable Sorting

Stable Sorting preserves the order of a column even after sorting on a second column.

KDB provides stable sorting by default. This is a great thing, and it allows you to easily implement grouped attributes on a table.

A table that looks like:

q)t:([]time:09:30:00+`time$til 9;sym:9#`AAPL`MSFT`IBM;px:100*9?1.0)
time sym px
09:30:00.000 AAPL 81.12026
09:30:00.001 MSFT 20.86614
09:30:00.002 IBM 99.07116
09:30:00.003 AAPL 57.94801
09:30:00.004 MSFT 90.29713
09:30:00.005 IBM 20.11578
09:30:00.006 AAPL 6.832366
09:30:00.007 MSFT 59.89167
09:30:00.008 IBM 4.881728

which is sorted by time and sort it by symbol so that all the symbols that are the same are together, while still preserving the time attribute.

q)`sym xasc t
time sym px
09:30:00.000 AAPL 81.12026
09:30:00.003 AAPL 57.94801
09:30:00.006 AAPL 6.832366
09:30:00.002 IBM 99.07116
09:30:00.005 IBM 20.11578
09:30:00.008 IBM 4.881728
09:30:00.001 MSFT 20.86614
09:30:00.004 MSFT 90.29713
09:30:00.007 MSFT 59.89167

While this seems pretty cool, I want to argue that this behavior is not only good, it is also simpler than the alternative.

In many languages, perl,java,python there is something called a comparator interface. Which usually asks the user to define a function

F(a,b) => if a comes before b;
if a comes after b:
if they are equal

The search then uses some algo to swap the items.

However, we can simplify this interface. We can call it swapporator, it only allows two options, swap, True or False. The items are presented in the order of the original list to the swap function. This will automatically be stable.