This is a variation on the word ladder puzzle. Here is the puzzle:
You are presented with a set of strings (assume they are unique). Each string can be of arbitrary length. Your goal is to find the length of the longest chain.
A chain can start from any string. Each following link in the chain must be exactly one letter shorter than the previous link, any character from any position from the previous link can be dropped, this new link must appear in the set of words given.
For example, if you are given the set of strings (“a”,”ab”,”abc”,”abdc”, “babdc”,”dd”, “ded”)
Both
“ded” -> “dd”
“babdc” -> “abdc” -> “abc” -> “ab” -> “a”
are valid chains. The longest chain’s length is 5.
First I will present python code that my friend wrote, it is very concise and has a kdb flavor (at least in my mind).
Python Code:
def word_list(array):
# Storing the solutions for each word in a dictionary
dict_array = {word:0 for word in array}
# Sorting the word list by lengths
array.sort(key = lambda s: len(s))
# Finding solutions to each word by using the previous solutions
for word in array:
options = [0]
for i in range(len(word)):
word_minus_i = "".join([word[k] for k in range(len(word)) if k != i])
#check if the word appears in list of words
if word_minus_i in dict_array:
# If it does add a link to that word
options.append(dict_array[word_minus_i]+1)
dict_array[word] = max(options)
return max(dict_array.values())+1
Now to a KDB version:
I had two versions of the code both rely on the same insight as the python code
The difference is in the post processing.
The key insight into this problem is that if you hash all the strings. Then if you create every possible one letter drop you can check if that string is in the legal set of strings, if it is you can record which longer string it maps to.
At that point you have a list of every connection between two strings. Since checking the hash takes constant time and generating every 1 letter missing string takes 1- number of letters in the string* number of unique strings. This processing takes linear time.
At that point, we have a graph, and at first I wanted to reuse the code from the previous post. I would generate all the connected components, the use those nodes to figure out how many different length strings there were. The connected component with the largest number of distinct length strings is going to be the longest chain and the number of distinct lengths is the total chain length. (this code is [3-6]x faster for small inputs (10,000, 100,000 words) than the python but is about the same speed once we get up to a set of 1 million words)
The second version, uses a quicker technique to just calculate the depth of the graph from every node.
Let’s start from the first version:
Let’s generate 10,000 strings (.Q.a is “abcdefghijklmnopqrstuvwxyz”)
We want strings of length between 1 and 20 (1+til 20)
We want ten thousand strings, so we take 10000 of the list 1 2 3 .. 20
We then sample the number from the alphabet (3?.Q.a) gives us a 3 letter string
We do this for each right (\:) number in the list
We only want the distinct words
q)words:distinct words:(10000#1+til 20)?\:.Q.a
Here we are sampling 10 of these words:
q)10?words
“uqoakdinpzdsw”
“vzahb”
“hhotqkijpkyso”
“adbtny”
“iotrggqgxvkqmnxqaxp”
“rcgnpgotlzastbv”
“idptfuivvtz”
So we need a function that will generate the indexes of all the possible 1 letter drops of a word.
The most intuitive way I know to do this, is to create equal length boolean string with one 0 and rotate that 0 around the whole string. So for example in the 3 letter word case we have:
011b
101b
110b
We then find the indexes where there is a 1 and that would gives all the 1 missing indexes.
/0b,(x-1)#1b creates a string of length x with 1 0 value
/rotate\: says to rotate this string for each right argument (which is just the range from 0 to x)
/where finds the indexes that are 1
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b}
Lets do a little processing on the list of strings, and put them into a table:
/sort the words by count of the number of letters and only store the distinct words in hwords
q)hwords:distinct words iasc count each words
/create a table, with the distinct words, the number of letters in the word, and a symbolic version of the word for fast lookup, with a unique attribute so that KDB knows to hash this list
q)show t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
wsym w c
———–
s ,”s” 1
t ,”t” 1
j ,”j” 1
n ,”n” 1
/wsym contains a symbolic version of the word
/w is just the original word
/c is the length of the word
We then want to populate a dictionary with all the indexes, so we select the distinct counts of the words and run them through the allDrops function
q)drops:n!allDrops each n:exec distinct c from t /we select the distinct counts run the function and key the distinct counts to the result from the function.
q)drops
1 | ,`long$()
2 | (,1;,0)
3 | (1 2;0 1;0 2)
4 | (1 2 3;0 1 2;0 1 3;0 2 3)
5 | (1 2 3 4;0 1 2 3;0 1 2 4;0 1 3 4;0 2 3 4)
…
We do this mostly because we know that we will need these indexes many times, but there are relatively few distinct lengths of words.
Now we calculate all the links:
/drops c will give us a list of lists of all the 1 letter missing possible indexes for each word.
/To show what this does let’s select by c so that we can see what it does for each c
select d:drops first c by c from t
c | d ..
–| ————————————————————————-..
1 | ,`long$() ..
2 | (,1;,0) ..
3 | (1 2;0 1;0 2) ..
4 | (1 2 3;0 1 2;0 1 3;0 2 3)
Because q will apply a particular list on indexes and preserve the shape, we just need to tell it, that each word should be projected onto all of these lists. This is done with the @’ this will pair off each list of lists with the words. Here is what that looks like , again grouping by c so that we can see what that looks like:
q)select by c from update d:(w)@’drops c from t
c | wsym w d ..
–| ————————————————————————-..
1 | x ,”x” ,”” ..
2 | cw “cw” (,”w”;,”c”) ..
3 | fvw “fvw” (“vw”;”fv”;”fw”) ..
4 | yoku “yoku” (“oku”;”yok”;”you”;”yku”)
We see that one letter words, become empty strings, two letter words become a list of words each 1 letter long. three letter strings become a list of 2 letter words.
Now we just need to find where these words are in our original table. If they are not there they will get the 1+last index, which is how q lets you know that a value is missing in a lookup. We will also cast all these strings to symbols so that we can do faster lookups (`$):
/col is the index of the original string and row are all the matches. We are modeling the graph using an adjacency matrix idea.
q)show sparseRes:update col:i, row:t[`wsym]?`$(w@’drops c) from t
wsym w c col row
———————-
h ,”h” 1 0 864593
r ,”r” 1 1 864593
f ,”f” 1 2 864593
w ,”w” 1 3 864593
/We see that no 1 letter string has any matches which is why all of the indexes are last in the table, this makes sense since one letter strings can’t match anything but the empty string and all of our strings have at least one character
/Again selecting by c so we can see a variety of lengths,
q)select by c from update col:i, row:t[`wsym]?`$(w@’drops c) from t
c | wsym w col row ..
–| ————————————————————————-..
1 | x ,”x” 25 ,864593 ..
2 | cw “cw” 701 3 11 ..
3 | fvw “fvw” 17269 541 275 357 ..
4 | yoku “yoku” 64705 864593 7279 2368 3524 ..
5 | vvbud “vvbud” 114598 864593 46501 864593
As we can see the row column is a list of matches and also contains false matches that go to the end of the table we would like to clean that up. We can do that using ungroup.
Ungroup will take a keyed table, and generate a row for each item in the list of records.
In our case it will create a col, row record for each item in the row column.
q)show sparseRes:ungroup `col xkey sparseRes
col wsym w c row
——————-
0 h h 1 864593
1 r r 1 864593
2 f f 1 864593
Now we want to delete all of the fake links, so that will be wherever the row is the length of the original table:
q) show sparseRes:delete from sparseRes where row=count t
col wsym w c row
—————-
26 bj b 2 2
26 bj j 2 10
27 oj o 2 2
27 oj j 2 12
At this point we are done processing the data. We have found all the connections between the words, all that is left is to find the connected components, which we know how to do from the previous post. I use a slightly different function that calculates the connected components on a sparse matrix. That is represented as a table with columns row and col.
findConnectedSparse:{[j;m]
neighbors: exec col from m where row in j; /now we are searching a table instead of a matrix
f:{n:exec col from y where row in .[_;x]; /new neighbors
x[0]:count x[1];x[1]:distinct x[1],n; /update the two pieces of x
x}[;m]; /project this function on the sparseMatrix
last f over (0; neighbors)};
allComponentsSparse:{[m]
points:til max m[`row]; /
connected:();
while[count points;
i:first points;
connected,:enlist n:`s#n:distinct asc i,findConnectedSparse[i;m]; /add point in case it is island
points:points except n];
connected}
Given these two functions that work on sparse adjacency matrixes. The final step is:
q)max {count select distinct c from x} each t allComponentsSparse sparseRes
Let’s unpack this a bit:
sparseRes is our table of links.
allComponentsSparse will return all of the connected components in this graph.
`s#0 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 3..
`s#1 62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 2..
`s#2 29 40 66 91 108 112 114 125 135 154 167 168 184 211 226 245 267 281 291 ..
`s#3 69 77 94 102 109 110 119 129 159 183 192 196 212 214 216 220 224 247 259..
…
apply the table t to this will give us a table that is indexed on the connected components
+`wsym`w`c!(`q`aq`oq`qw`wq`jq`tq`yq`qv`nq`xq`qa`zq`qn`qc`qj`qf`qd`bq`qb`lq`dq..
+`wsym`w`c!(`b`fb`bg`jb`zb`hb`cb`ba`bc`be`bm`vb`kb`xb`bp`bu`by`bj`ub`mb`pb`bn..
+`wsym`w`c!(`m`fm`mz`mh`mx`dm`xm`md`me`mc`bm`nm`mm`cm`pm`mi`ml`mb`ms`rm`jm`ym..
+`wsym`w`c!(`j`jj`jb`jq`jt`jg`jd`ji`ej`jy`uj`jz`gj`jv`bj`rj`qj`je`jn`xj`lj`hj..
So for each connected component we are getting a table that has only those strings that match.
let’s look at the first one:
q)first t allComponentsSparse sparseRes
wsym w c
———–
q ,”q” 1
aq “aq” 2
oq “oq” 2
qw “qw” 2
wq “wq” 2
jq “jq” 2
We can see immediately a problem, we only need one 2 letter word but we are getting every possible match. Since we are only interested in the length we can simply count the number of distinct c.
q)count select distinct c from first t allComponentsSparse sparseRes
4
So we see that the first connected component has 4 links in the chain.
Doing this for each connected component and then selecting the max ensures that we have found the length of the longest chain.
q)max {count select distinct c from x} each t allComponentsSparse sparseRes
5
That completes the first method.
Here is the whole function together
longestChain:{[words]
hwords:distinct words iasc count each words;
t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b};
drops:n!allDrops'[n:1+til max t[`c]];
sparseRes:ungroup `col xkey update col:i, row:t[`wsym]?`$(w@’drops c) from t;
sparseRes:delete from sparseRes where row=count t;
max {count select distinct c from x} each t allComponentsSparse sparseRes}
findConnectedSparse:{[j;m]
neighbors: exec col from m where row in j;
f:{n:exec col from y where row in .[_;x]; /new neighbors
x[0]:count x[1];x[1]:distinct x[1],n; /update the two pieces of x
x}[;m]; /project this function on the sparse matrix
last f over (0; neighbors)};
allComponentsSparse:{[m]
points:til max m[`row];
connected:();
while[count points;
i:first points;
connected,:enlist n:`s#n:distinct asc i,findConnectedSparse[i;m]; /add point in case it is island
points:points except n];
connected}
Now for method 2, instead of relying on the previous algorithm to find all the connected components, we just want to keep track of how deep each chain is.
We can do this by creating a new table q which only has columns, row and col and is grouped by row. Since we created the original links by going from columns to rows, we know that the links travel up from shorter strings(row) to longer ones(col)
q)show q:select col by row from sparseRes
row| col ..
—| ————————————————————————..
0 | 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 3..
1 | 62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 2..
2 | 29 40 66 91 108 112 114 125 135 154 167 168 168 184 211 226 245 267 281 ..
3 | 69 69 77 94 102 109 110 119 129 159 183 192 196 212 214 216 220 224 247 ..
4 | 38 48 60 74 86 92 95 115 123 131 132 135 145 184 195 197 213 238 246 249..
5 | 43 49 119 124 137 142 143 180 219 226 231 239 246 277 289 303 308 319 32..
6 | 28 31 72 73 88 100 104 126 141 145 161 175 190 198 235 240 248 253 289 3..
Now since this table is keyed on the row column, we can index into the table using a index table that has the row numbers and keep doing this until there are no more rows to return. The number of times we can index before we return nothing, is the length of the chain from that row.
First lets demonstrate indexing:
q)indexTable:([]row:0 1)
row
—
0
1
q)q ([]row:0 1)
col
——————————————————————————-
28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 323 342 36..
62 64 77 79 98 115 126 131 144 154 162 166 173 189 191 208 214 258 267 272 27..
We get the two rows that match from q.
If we flatten this list and rename this column row. We can reindex into q,
q)q select row:raze col from q ([]row:0 1)
col
——————-
452 549 634
,493
391 391
587 786 803
395 493 759
524 629
704 834
613 823
We see the next level of neighbors, we can keep doing this. Again we can take advantage of over and just count how long before we return an empty table.
We will do this by creating a function that takes a dictionary and updates it, that way we can store both the intermediate results and the number of times we went down and all the nodes we visited.
the keys will be cur, which is the list of nodes we wish to visit next, depth, which is how deep we gone so far, and visited a list of all the nodes we saw. Then we can write a function dive which will dive one level down into the table.
dive:{ $[count x[`cur]:select row:raze col from y x[`cur];
/index the table on the current neighbors we are exploring.
/We set cur to be the next level of neighbors
/if there are any new neighbors:
/we will add 1 to depth, and add the current new neighbors to visited and return x
[x[`depth]+:1;x[`visited],:exec row from x[`cur];
:x]
/otherwise we will return x
;:x]}[;q]
We then create a table of all the nodes we would like to visit with the initialized keys:
q)show n:update visited:count[n]#() from n:([]depth:0;cur:exec row from q);
depth cur visited
—————–
0 0
0 1
0 2
0 3
0 4
lets’ run dive on one the first of n:
q)first n
depth | 0
cur | 0
visited| ()
q)dive over first n
depth | 4
cur | +(,`row)!,()
visited| 28 42 56 61 94 117 133 149 157 170 175 181 187 195 220 241 260 288 3..
We see that we get back a dictionary of all the nodes visited as well as, the maximum depth achieved. If we do this for all n, we can select the max depth and we are done.
q)exec max depth from (dive/)each n}
5
Notice, we are storing all the visited nodes, so we can reconstruct what the path actually was, but we are not actually using it. So the entire second version looks like this:
longestChainFast:{[words]
hwords:distinct words iasc count each words;
t:([]wsym:`u#`$hwords;w:hwords;c:count each hwords);
allDrops:{where each (til x) rotate\: 0b,(x-1)#1b};
drops:n!allDrops'[n:1+til max t[`c]];
sparseRes:ungroup `col xkey update col:i, row:t[`wsym]?`$(w@’drops c) from t;
sparseRes:delete from sparseRes where row=count t;
q:select col by row from sparseRes;
dive:{ $[count x[`cur]:select row:raze col from y x[`cur];
[x[`depth]+:1;x[`visited],:exec row from x[`cur];:x]
;:x]}[;q];
n:update visited:count[n]#() from n:([]depth:0;cur:exec row from q);
exec max depth from (dive/)each n}
On my computer, I was able to find the length of the longest chain for a list of million words in under 15 seconds, using this second version and in a minute using the first version. If I was using less than one hundred thousand words, then the two versions were about the same.
The python code took about a minute for a million words, (without using anything fancy, I’m sure it can be improved by replacing pieces with numpy components or using the latest version of python 3.7. where dictionary look ups are faster).
The real benefit here from KDB, is that reconstructing the path is super straightforward since we have the path indexes and can use them to index into the word table.