Backtracking In Q/ word ladder

This is a variation on the word ladder puzzle. Here is the puzzle:

You are presented with a set of strings (assume they are unique). Each string can be of arbitrary length. Your goal is to find the length of the longest chain.
A chain can start from any string. Each following link in the chain must be exactly one letter shorter than the previous link, any character from any position from the previous link can be dropped, this new link must appear in the set of words given.

For example, if you are given the set  of strings (“a”,”ab”,”abc”,”abdc”, “babdc”,”dd”, “ded”)

Both
“ded” -> “dd”
“babdc” -> “abdc” -> “abc” -> “ab” -> “a”

are valid chains. The longest chain’s length is 5.

First I will present python code that my friend wrote, it is very concise and has a kdb flavor (at least in my mind).

 

Python Code:

def word_list(array):
  # Storing the solutions for each word in a dictionary
  dict_array = {word:0 for word in array}

  # Sorting the word list by lengths
  array.sort(key = lambda s: len(s))

  # Finding solutions to each word by using the previous solutions
  for word in array:
    options = [0]
    for i in range(len(word)):
      word_minus_i = "".join([word[k] for k in range(len(word)) if k != i])
      #check if the word appears in list of words
      if word_minus_i in dict_array:
        # If it does add a link to that word
        options.append(dict_array[word_minus_i]+1)
    dict_array[word] = max(options)
  return max(dict_array.values())+1

Now to a KDB version:

I had two versions of the code both rely on the same insight as the python code

The difference is in the post processing.

The key insight into this problem is that if you hash all the strings. Then if you create every possible one letter drop you can check if that string is in the legal set of strings, if it is you can record which longer string it maps to.

At that point you have a list of every connection between two strings. Since checking the hash takes constant time and generating every 1 letter missing string takes 1- number of letters in the string* number of unique strings. This processing takes linear time.

At that point, we have a graph, and at first I wanted to reuse the code from the previous post. I would generate all the connected components, the use those nodes to figure out how many different length strings there were. The connected component with the largest number of distinct length strings is going to be the longest chain and the number of distinct lengths is the total chain length. (this code is [3-6]x faster for small inputs (10,000, 100,000 words) than the python but is about the same speed once we get up to a set of 1 million words)

The second version, uses a quicker technique to just calculate the depth of the graph from every node.

Let’s start from the first version:
Let’s generate 10,000 strings (.Q.a is “abcdefghijklmnopqrstuvwxyz”)
We want strings of length between 1 and 20 (1+til 20)
We want ten thousand strings, so we take 10000 of the list 1 2 3 .. 20
We then sample the number from the alphabet (3?.Q.a) gives us a 3 letter string
We do this for each right (\:) number in the list
We only want the distinct words

q)words:distinct words:(10000#1+til 20)?\:.Q.a

Here we are sampling 10 of these words:

q)10?words
“uqoakdinpzdsw”
“vzahb”
“hhotqkijpkyso”
“adbtny”
“iotrggqgxvkqmnxqaxp”
“rcgnpgotlzastbv”
“idptfuivvtz”

So we need a function that will generate the indexes of all the possible 1 letter drops of a word.
The most intuitive way I know to do this, is to create equal length boolean string with one 0 and rotate that 0 around the whole string. So for example in the 3 letter word case we have:
011b
101b
110b
We then find the indexes where there is a 1 and that would gives all the 1 missing indexes.

/0b,(x-1)#1b creates a string of length x with 1 0 value
/rotate\: says to rotate this string for each right argument (which is just the range from 0 to x)
/where finds the indexes that are 1
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b}

Lets do a little processing on the list of strings, and put them into a table:

/sort the words by count of the number of letters and only store the distinct words in hwords

q)hwords:distinct words iasc count each words

/create a table, with the distinct words, the number of letters in the word, and a symbolic version of the word for fast lookup, with a unique attribute so that KDB knows to hash this list

q)show t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
wsym w c
———–
s ,”s” 1
t ,”t” 1
j ,”j” 1
n ,”n” 1

/wsym contains a symbolic version of the word
/w is just the original word
/c is the length of the word

We then want to populate a dictionary with all the indexes, so we select the distinct counts of the words and run them through the allDrops function

q)drops:n!allDrops each n:exec distinct c from t /we select the distinct counts run the function and key the distinct counts to the result from the function.
q)drops
1 | ,`long$()
2 | (,1;,0)
3 | (1 2;0 1;0 2)
4 | (1 2 3;0 1 2;0 1 3;0 2 3)
5 | (1 2 3 4;0 1 2 3;0 1 2 4;0 1 3 4;0 2 3 4)

We do this mostly because we know that we will need these indexes many times, but there are relatively few distinct lengths of words.

Now we calculate all the links:

/drops c will give us a list of lists of all the 1 letter missing possible indexes for each word.
/To show what this does let’s select by c so that we can see what it does for each c
select d:drops first c by c from t
c | d ..
–| ————————————————————————-..
1 | ,`long$() ..
2 | (,1;,0) ..
3 | (1 2;0 1;0 2) ..
4 | (1 2 3;0 1 2;0 1 3;0 2 3)

Because q will apply a particular list on indexes and preserve the shape, we just need to tell it, that each word should be projected onto all of these lists. This is done with the @’ this will pair off each list of lists with the words. Here is what that looks like , again grouping by c so that we can see what that looks like:

q)select by c from update d:(w)@’drops c from t
c | wsym w d ..
–| ————————————————————————-..
1 | x ,”x” ,”” ..
2 | cw “cw” (,”w”;,”c”) ..
3 | fvw “fvw” (“vw”;”fv”;”fw”) ..
4 | yoku “yoku” (“oku”;”yok”;”you”;”yku”)

We see that one letter words, become empty strings, two letter words become a list of words each 1 letter long. three letter strings become a list of 2 letter words.
Now we just need to find where these words are in our original table. If they are not there they will get the 1+last index, which is how q lets you know that a value is missing in a lookup. We will also cast all these strings to symbols so that we can do faster lookups (`$):

/col is the index of the original string and row are all the matches. We are modeling the graph using an adjacency matrix idea.
q)show sparseRes:update col:i, row:t[`wsym]?`$(w@’drops c) from t
wsym w c col row
———————-
h ,”h” 1 0 864593
r ,”r” 1 1 864593
f ,”f” 1 2 864593
w ,”w” 1 3 864593
/We see that no 1 letter string has any matches which is why all of the indexes are last in the table, this makes sense since one letter strings can’t match anything but the empty string and all of our strings have at least one character
/Again selecting by c so we can see a variety of lengths,
q)select by c from update col:i, row:t[`wsym]?`$(w@’drops c) from t
c | wsym w col row ..
–| ————————————————————————-..
1 | x ,”x” 25 ,864593 ..
2 | cw “cw” 701 3 11 ..
3 | fvw “fvw” 17269 541 275 357 ..
4 | yoku “yoku” 64705 864593 7279 2368 3524 ..
5 | vvbud “vvbud” 114598 864593 46501 864593

As we can see the row column is a list of matches and also contains false matches that go to the end of the table we would like to clean that up. We can do that using ungroup.
Ungroup will take a keyed table, and generate a row for each item in the list of records.
In our case it will create a col, row  record for each item in the row column.

q)show sparseRes:ungroup `col xkey sparseRes
col wsym w c row
——————-
0 h h 1 864593
1 r r 1 864593
2 f f 1 864593

Now we want to delete all of the fake links, so that will be wherever the row is the length of the original table:

q) show sparseRes:delete from sparseRes where row=count t
col wsym w c row
—————-
26 bj b 2 2
26 bj j 2 10
27 oj o 2 2
27 oj j 2 12

At this point we are done processing the data.  We have found all the connections between the words, all that is left is to find the connected components, which we know how to do from the previous post. I use a slightly different function that calculates the connected components on a sparse matrix. That is represented as a table with columns row and col.

findConnectedSparse:{[j;m]
neighbors: exec col from m where row in j; /now we are searching a table instead of a matrix
f:{n:exec col from y where row in .[_;x]; /new neighbors
x[0]:count x[1];x[1]:distinct x[1],n; /update the two pieces of x
x}[;m]; /project this function on the sparseMatrix
last f over (0; neighbors)};

allComponentsSparse:{[m]
points:til max m[`row]; /
connected:();
while[count points;
i:first points;
connected,:enlist n:`s#n:distinct asc i,findConnectedSparse[i;m]; /add point in case it is island
points:points except n];
connected}

Given these two functions that work on sparse adjacency matrixes. The final step is:

q)max {count select distinct c from x} each t allComponentsSparse sparseRes

Let’s unpack this a bit:

sparseRes is our table of links.

allComponentsSparse will return all of the connected components in this graph.

`s#0 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 3..
`s#1 62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 2..
`s#2 29 40 66 91 108 112 114 125 135 154 167 168 184 211 226 245 267 281 291 ..
`s#3 69 77 94 102 109 110 119 129 159 183 192 196 212 214 216 220 224 247 259..

apply the table t to this will give us a table that is indexed on the connected components

+`wsym`w`c!(`q`aq`oq`qw`wq`jq`tq`yq`qv`nq`xq`qa`zq`qn`qc`qj`qf`qd`bq`qb`lq`dq..
+`wsym`w`c!(`b`fb`bg`jb`zb`hb`cb`ba`bc`be`bm`vb`kb`xb`bp`bu`by`bj`ub`mb`pb`bn..
+`wsym`w`c!(`m`fm`mz`mh`mx`dm`xm`md`me`mc`bm`nm`mm`cm`pm`mi`ml`mb`ms`rm`jm`ym..
+`wsym`w`c!(`j`jj`jb`jq`jt`jg`jd`ji`ej`jy`uj`jz`gj`jv`bj`rj`qj`je`jn`xj`lj`hj..

So for each connected component we are getting a table that has only those strings that match.
let’s look at the first one:

q)first t allComponentsSparse sparseRes
wsym w c
———–
q ,”q” 1
aq “aq” 2
oq “oq” 2
qw “qw” 2
wq “wq” 2
jq “jq” 2

We can see immediately a problem, we only need one 2 letter word but we are getting every possible match. Since we are only interested in the length we can simply count the number of distinct c.

q)count select distinct c from first t allComponentsSparse sparseRes
4

So we see that the first connected component has 4 links in the chain.

Doing this for each connected component and then selecting the max ensures that we have found the length of the longest chain.

q)max {count select distinct c from x} each t allComponentsSparse sparseRes
5

That completes the first method.

Here is the whole function together

longestChain:{[words]
hwords:distinct words iasc count each words;
t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b};
drops:n!allDrops'[n:1+til max t[`c]];
sparseRes:ungroup `col xkey update col:i, row:t[`wsym]?`$(w@’drops c) from t;
sparseRes:delete from sparseRes where row=count t;
max {count select distinct c from x} each t allComponentsSparse sparseRes}

findConnectedSparse:{[j;m]
neighbors: exec col from m where row in j;
f:{n:exec col from y where row in .[_;x]; /new neighbors
x[0]:count x[1];x[1]:distinct x[1],n; /update the two pieces of x
x}[;m]; /project this function on the sparse matrix
last f over (0; neighbors)};

allComponentsSparse:{[m]
points:til max m[`row];
connected:();
while[count points;
i:first points;
connected,:enlist n:`s#n:distinct asc i,findConnectedSparse[i;m]; /add point in case it is island
points:points except n];
connected}

Now for method 2, instead of relying on the previous algorithm to find all the connected components, we just want to keep track of how deep each chain is.

We can do this by creating a new table q which only has columns, row and col and is grouped by row. Since we created the original links by going from columns to rows, we know that the links travel up from shorter strings(row) to longer ones(col)

q)show q:select col by row from sparseRes
row| col ..
—| ————————————————————————..
0 | 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 3..
1 | 62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 2..
2 | 29 40 66 91 108 112 114 125 135 154 167 168 168 184 211 226 245 267 281 ..
3 | 69 69 77 94 102 109 110 119 129 159 183 192 196 212 214 216 220 224 247 ..
4 | 38 48 60 74 86 92 95 115 123 131 132 135 145 184 195 197 213 238 246 249..
5 | 43 49 119 124 137 142 143 180 219 226 231 239 246 277 289 303 308 319 32..
6 | 28 31 72 73 88 100 104 126 141 145 161 175 190 198 235 240 248 253 289 3..

Now since this table is keyed on the row column, we can index into the table using a index table that has the row numbers and keep doing this until there are no more rows to return. The number of times we can index before we return nothing, is the length of the chain from that row.

First lets demonstrate indexing:

q)indexTable:([]row:0 1)
row

0
1
q)q ([]row:0 1)
col
——————————————————————————-
28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 342 36..
62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 272 27..

We get the two rows that match from q.

If we flatten this list and rename this column row. We can reindex into q,

q)q select row:raze col from q ([]row:0 1)
col
——————-
452 549 634
,493
391 391
587 786 803
395 493 759
524 629
704 834
613 823

We see the next level of neighbors, we can keep doing this. Again we can take advantage of over and just count how long before we return an empty table.

We will do this by creating a function that takes a dictionary and updates it, that way we can store both the intermediate results and the number of times we went down and all the nodes we visited.

the keys will be cur, which is the list of nodes we wish to visit next, depth, which is how deep we gone so far, and visited a list of all the nodes we saw. Then  we can write a function dive which will dive one level down into the table.

dive:{ $[count x[`cur]:select row:raze col from y x[`cur];
/index the table on the current neighbors we are exploring.
/We set cur to be the next level of neighbors
/if there are any new neighbors:
/we will add 1 to depth, and add the current new neighbors to visited and return x
[x[`depth]+:1;x[`visited],:exec row from x[`cur];
:x]
/otherwise we will return x
;:x]}[;q]

We then create a table of all the nodes we would like to visit with the initialized keys:

q)show n:update visited:count[n]#() from n:([]depth:0;cur:exec row from q);
depth cur visited
—————–
0 0
0 1
0 2
0 3
0 4

lets’ run dive on one the first of n:

q)first n
depth | 0
cur | 0
visited| ()
q)dive over first n
depth | 4
cur | +(,`row)!,()
visited| 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 3..

We see that we get back a dictionary of all the nodes visited as well as, the maximum depth achieved. If we do this for all n, we can select the max depth and we are done.

q)exec max depth from (dive/)each n}
5

Notice, we are storing all the visited nodes, so we can reconstruct what the path actually was, but we are not actually using it. So the entire second version looks like this:

longestChainFast:{[words]
hwords:distinct words iasc count each words;
t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b};
drops:n!allDrops'[n:1+til max t[`c]];
sparseRes:ungroup `col xkey update col:i, row:t[`wsym]?`$(w@’drops c) from t;
sparseRes:delete from sparseRes where row=count t;
q:select col by row from sparseRes;
dive:{ $[count x[`cur]:select row:raze col from y x[`cur];
[x[`depth]+:1;x[`visited],:exec row from x[`cur];:x]
;:x]}[;q];
n:update visited:count[n]#() from n:([]depth:0;cur:exec row from q);
exec max depth from (dive/)each n}

On my computer, I was able to find the length of the longest chain for a list of million words in under 15 seconds, using this second version and in a minute using the first version. If I was using less than one hundred thousand words, then  the two versions were about the same.

The python code took about a minute for a million words, (without using anything fancy, I’m sure it can be improved by replacing pieces with numpy components or using the latest version of python 3.7. where dictionary look ups are faster).

The real benefit here from KDB, is that reconstructing the path is super straightforward since we have the path indexes and can use them to index into the word table.

 

 

Advertisements

Graph Algo, Finding Friends

This is a rather standard interview question:

Suppose that you have points and the points are connected by edges. Imagine that the points are numbered, then we can represent whether there exists a connection between two points by writing all the pairs of points.

(1,2)
(2,3)
(5,1)
(4,6)
We can also represent this graph, as matrix, where 1 means that there exists a connection between i and j and 0 otherwise. We call this matrix the adjacency matrix since it tells us which points are connected.adjacencymatrix_1002

Given such a matrix, we want to find all the connected components. A connected component is either just one node if it has no connections or all the nodes that can be reached by traveling along the edges.

In the picture above, each of these graphs has only one connected component, but together there are three distinct connected components, since there is no way to reach from the first graph to the second by traveling along an edge.

First I will show my original solution in KDB. Then a slight improvement.
First note that the matrix must be symmetric, since if point 1 is connected to point 2, then point 2 is connected to point 1.

I wanted to write a function that would get all the points that were connected to a single point.

If I think of a particular row in this matrix, then all the 1s represent the neighbors of this point. If I then find all of their neighbors, and so on, I will have all the points that can possibly be reached.

First imagine I have a matrix, a:
To create it, I first create 100 zeros (100#0)
Then I pick six places to add connections, (6?100)
I replace the zeros with 1s.
I add this matrix to itself flipped, so that it is symmetric (a+flip a)
Divide by two so that any place that 1s overlapped are either 1 or .5, (.5*)
Then I cast it to an integer so that they are all 1s or 0s (`long$)

q)a:`long$.5*a+flip a:10 10#@[100#0;6?100;:;1]
q)a
0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0

Okay given a, I can find the neighbors of any of the 10 points.

q)a[1]
0 0 0 0 0 1 1 0 0 0
q)where a[1]
5 6

So we get a list of points that are neighbors.

We need to create a function that will keep asking for neighbors and discard any duplicates.

We can find all the neighbors of the neighbors by selecting the neighbor rows of matrix  and then asking where for each row:

q)where each a 5 6
1
1

We get a list of neighbors.

We can then raze this list to flatten it:

q)raze where each a 5 6
1 1

Then we can ask for the distinct items:

q)distinct raze where each a 5 6
,1

Now we would like to add this list to our original neighbors and check if we can find more neighbors:

q)distinct raze where each a 5 6 1
1 5 6

Running this again we get:

q)distinct raze where each 1 5 6
5 6 1

So there are no more neighbors to find, just the order changes and we will flip back and forth.

Anytime, we need to apply a function like this we can make use of the adverb in q called over. Over will apply the function until the results are the same for two successive iterations.

So my function to find all connected with comments looks like:

findAllConnected:{[i;m]
neighbors: where m[i]; /find the initial list of neighbors
f:{distinct raze x,where each y x}[;m]; /create the function where second argument     is fixed to be the adjacency matrix.
f over neighbors} /run the function over the initial neighbors and keep adding neighbors until there are no more to add.

/testing on a:
q)findAllConnected[1;a]
5 6 1

As expected this gives us all points connected to point 1, including point 1 itself.
To list all of the connected components we need to run this on each of the points:

q)findAllConnected[;a] each til count a
3 0
5 6 1
8 2 9
0 3
7 4
1 5 6
1 5 6
4 7
2 9 8
8 2 9

 

We get 10 rows specifying the connections. Now we need find the unique set.

The easiest way to do this is to sort each one and ask for the distinct lists:

q)distinct asc each findAllConnected[;a] each til count a
`s#0 3
`s#1 5 6
`s#2 8 9
`s#4 7

Which gives us 4 connected components.

We can look for the islands, by checking if there are any elements not on any of these lists. If so, they must unconnected to anything else.

Items not in the list are islands (only one node no connections)

q)connected:distinct asc each findAllConnected[;a] each til count a
q)islands:(til count a) except raze connected
`long$()

q)count each connected /size of the components

We get an empty list, which means there are no islands to report.

Problem solved.

For those who have followed, you can see that we are actually doing a lot of duplicate work, since we keep asking for neighbors for points that we have already seen. To solve this we simply keep track of how many neighbors we have found so far and only query for the neighbors we have yet to see.

Instead of x being a list of neighbors, it is now a list with two items. The first is the number of neighbors found so far and the second is the list of neighbors.

We can use dot(.) application of drop(_) to remove the neighbors we have seen so far from the neighbors list. (.[_;x]):

findAllConnectedFast:{[i;m]

neighbors: where m[i];

f:{n:raze where each y .[_;x]; /new neighbors

x[0]:count x[1];x[1]:distinct x[1],n; /update the two pieces of x

x}[;m]; /project this function on the matrix

last f over (0; neighbors)} /apply the function on the neighbors, with 0 currently visited

/running on a:

q)findAllConnectedFast[1;a]
5 6 1

/large sparse matrix to test with

q)c:7h$.5*c+flip c:1000 1000#@[1000000#0;1000?1000000;:;1]

/Time both on my 2012 macbook
q)\t distinct asc each findAllConnected[;c] each til count c
11960
q)\t distinct asc each findAllConnectedFast[;c] each til count c
1170

A 10x improvement, the keen observer will notice that we don’t have to run the function on all the points in the first place.  We can go down the points and skip all the points that we have already seen. Here is what that looks like:

allComponents:{[m]

points:til count m;

connected:();

while[count points;

i:first points;

connected,:enlist n:`s#n:distinct asc i,findAllConnectedFast[i;m]; /add point in case it is island

points:points except n];

connected}

q)allComponents a
`s#0 3
`s#1 5 6
`s#2 8 9
`s#4 7

q)\t allComponents c
6

This results in a 1000x increase speed up. Because we are only calling the function the minimum number of times.