Tree Tables/Parent Vector representation of Dictionaries

Introduction:

http://archive.vector.org.uk/art10500340

Vector, the Journal of the British APL Association

TreeTables and the parent vector are magical things and they are a great way to flatten deeply nested structures. In particular, I will use this representation of a dictionary in the next post to allow you to import q libraries under a different namespace, kind of like what python allows with “import x as y”.

In this post I will focus on how we can easily rewrite a dictionary into a flat table with 4 columns. Let me motivate this a bit first. Take a deeply nested dictionary:

a| `b`c!(2;`d`e!(6;`f`g!7 8))
b| `c`d!4 5
c| `d`e!(6;`f`g!7 8)

In json that might look like this:

 
  "a": 
    "b":2,
    "c": 
      "d":6,
      "e": 
        "f":7,
        "g":8
      }
    }
  },
  "b": 
    "c":4,
    "d":5
  },
  "c": 
    "d":6,
    "e": 
      "f":7,
      "g":8
    }
  }
}

As we can see each key is a letter and each value is either another dictionary or a number. In q we can apply functions into these arbitrarily nested dictionaries provided all the values conform. So for example we can add 10+d

a| `b`c!(12;`d`e!(16;`f`g!17 18))
b| `c`d!14 15
c| `d`e!(16;`f`g!17 18)

However, once your dictionaries stop conforming things can get a bit hairy. As a simple example let’s add a simple dictionary f!f to the key f in d

a| `b`c!(2;`d`e!(6;`f`g!7 8))
b| `c`d!4 5
c| `d`e!(6;`f`g!7 8)
f| (,`f)!,`f

Suddenly 10+d throws a type error. Because we can’t add 10 to the non numeric dictionary. We can always right logic to avoid those cases, but it would be simpler to simply pull out all the conforming values add 10 to them and then replace them in the original structure. A tabular tree representation aids in this. As an example the same dictionary in tabular form: (I have omitted some results in the middle for ease of reading.) Which I will call treeTable:

l p  c  d
———
0 0  :: 1
1 0  `a 1
1 0  `b 1
1 0  `c 1
1 0  `f 1
2 1  `b 1
2 1  `c 1

4 15 8  0
5 21 7  0
5 22 8  0

The column c (child) contains every key and value, column d tells you if the row is leaf node or has a dictionary below it. Pulling out all the numeric children is as simple as:

numeric:6 7 8 9h
select c from treeTable where not d, (abs type each c) in numeric
c
-
2
4
5
6
6
7
8
7
8

This table can then  be modified and assuming we can convert it back into its dictionary form we could then  work with nested data easily.

How the Magic is done:

At this point you either believe this structure is useful or you believe it isn’t but you are still interested in understanding how this transformation happens.

Let’s look at the original treeTable and unpack the meaning of the columns.

The first column  represents the level of the dictionary that the key or value is located in.
The second column is the index of the parent row in this table. The root of the table is self parenting meaning that the 0th row is at the 0th level and it’s parent is 0 (keep this in mind it will be useful in a second). All elements will have a parent.
The third column is the child and it is the value at this level of nesting. It will either be a key if there are more levels below or it will just be the value at that level.
The fourth and final column indicates whether or not this row is a dictionary. That is, whether the child should be treated as a key or a value.

To convert a dictionary into a treeTable we use breadth first search, that is the purpose of the column. We first define a primitive treeTable with only one row the root.

l p  c  d
---------
0 0  :: 1

All treeTables will have this row. If the thing we are trying to convert into a treeTable is not a dictionary but is instead SOMEKIND_OF_THING_THAT_IS_NOT_A_DICTIONARY. Then there will be only one more row in this table.

l p  c  d
---------
0 0  :: 1
1 0 SOMEKIND_OF_THING_THAT_IS_NOT_A_DICTIONARY 0

We will then know we are done because there are no more dictionaries to unpack at the last level.

If instead we get a dictionary. We will first record all the keys at that first level and return the table.

Our ability to find a value at a particular level is enabled through the use of the parent column in the treeTable. Suppose we have a simple dictionary:

 
 "d":6,
 "e": 
   "f":7,
   "g":8
  }
}

We can record this as the following two columns:

p c index
----------
0 :: 0 
0 `d 1 
0 `e 2 
2 `f 3 
2 `g 4 
1  6 5 
3  7 6 
4  8 7

The first row is the root it is self parenting. The next two rows are both top level keys. So their parent is the root. f and g both are under e so their parent is 2, but 6 is under d so it’s parent is 1. To make this easier to see, I have added the virtual index column which is always available.  Then the final two rows are under f and g respectively.

If we want the parent of particular row, and we have the parent column, which I will call p:0 0 0 2 2 1 3 4

We can index into that row to get the parent. p 6 -> 3 which is f

We can repeat this until we get to the root. p 3 -> 2; p 2-> 0 ; p 0 -> 0. To find the root in one step we can use KDB’s built-in converge function which will apply until two consecutive results are the same. This is why it was so convenient for the root to be self-parenting. So see the path to the root use scan instead of over.

p over 6 -> 0

p scan 6 -> 3 2 0. Now that we have a path, we just need to get the keys that correspond to that path, this is done by indexing the path against the child column.

c 3 2 0 -> f e ::

We now can get the unique path to any element.

The next step is using the path to index into any level of the dictionary. This is accomplished with a special object called getItems.

getItems is defined by combining the indexing at depth verb with a function that reverses the path list and checks if the path list happens to be only the root. In which case, we simply return the original item.

Using just those two ideas, we are able to construct the treeTable. The algorithm is to index one level at a time each time recording the if the level contains dictionaries or not. If it contains no dictionaries we are done and we will return the same table twice in row, which means that our function will converge. Using breadth first search we avoid any stackoverflow issues that could happen with a recursive solution, instead the function becomes tail recursive, meaning all the necessary ingredients to call the function again are returned as the output. That is why on the first call the function returns the first row of a treeTable. That way each call after simply indexes deeper into the original dictionary to return more levels of the treeTable.

The Code and An Example:

toTreeTable:{[d]
 getItems:('[;] over (.[d;];{$[x~(enlist[::]);x;1_reverse x]}));
 tTT:{[getItems;t] 
 $[98h~type t;;t:([]l:(1#0);p:0;c:(::);d:1b)];
 lev:last t[`l];
 k:exec i from t where l=lev,d;
 $[count k;;:t];
 paths:t[`c] (t[`p]\')k;
 items:getItems each paths;
 id:where bd:99h=type each items;
 p:raze (count each key each items[id])#'k[id];
 c:raze key each items[id];
 df:count[p]#1b;
 lvl:count[p]#lev+1;
 id:where not bd;
 p:p,k[id];
 c:c,items[id];
 df:df,count[id]#0b;
 lvl:lvl,count[id]#lev+1;
 t upsert flip `l`p`c`d!(lvl;p;c;df)}[getItems];
 tTT over ()}



/an example of a nested structure 


b:`c`d!4 5
e:`f`g!7 8
c:`d`e!(6;e)
a:`b`c!(2; c)
d:`a`b`c!(a;b;c)
q)toTreeTable d
l p c d
---------
0 0 :: 1
1 0 `a 1
1 0 `b 1
1 0 `c 1
2 1 `b 1
2 1 `c 1
2 2 `c 1
2 2 `d 1
2 3 `d 1
2 3 `e 1
3 5 `d 1
3 5 `e 1
3 9 `f 1
3 9 `g 1
3 4 2 0
3 6 4 0
3 7 5 0
3 8 6 0
4 11 `f 1
4 11 `g 1
4 10 6 0
4 12 7 0
4 13 8 0
5 18 7 0
5 19 8 0

 

And Back Again!

Now that we covered how to get a treeTable we can also understand how to go back to a dictionary.

We apply the opposite approach. The core function returns a dictionary. Each time we return a dictionary that is slightly deeper than the previous time. We put placeholder empty dictionaries until we build the final result. Since we know whether each row is a key or a value, we know whether the current item requires a placeholder.

toDictFromTreeTable:{[tt]
 tD:{[tt;dSoFar;lev]
 dS:exec {x!count[x]#enlist[()!()]}[c] by p from tt where l=lev, d;
 dS:dS,.[!; value exec p,c from tt where l=lev, not d];
 pR:tt[`c](-1_|:) each (tt[`p]\')[key dS];
 pC:tt[`c]key dS;
 paths:raze each {(1_x;enlist[y])}'[pR;pC];
 $[lev>1;.[;;:;]/[dSoFar;paths;value dS];first value dS]}[tt];
 tD/[()!();1+til last tt[`l]]}

Wow This is Even More General Than We Thought:

When I first built this, I tried to make sure that I covered simple dictionaries and values. So I was curious what would happen to keyed tables.  Keyed tables are special in that they are essentially dictionaries whose key and values are both dictionaries. Since a dictionary is a pair of lists and a list of dictionaries is a table. A keyed table is simply a dictionary whose key is a table and and whose value is a table. A trivial example to illustrate this point:

q)k:([]k:til 5)
k
-
0
1
2
3
4
q)v:([]v:10*til 5)
v 
--
0 
10
20
30
40
q)kv:k!v
k| v 
-| --
0| 0 
1| 10
2| 20
3| 30
4| 40
/Indexing against the key table returns the value table
q)kv[k]
v 
--
0 
10
20
30
40
/but we can also apply select only certain rows using the k column
/in this case I reverse the key table and take the first 2 rows.
q)kv[2#reverse k]
v 
--
40
30


Now what happens if we turn a key table into a treeTable:

l p c d
---------------
0 0 :: 1
1 0 (,`k)!,0 1
1 0 (,`k)!,1 1
1 0 (,`k)!,2 1
1 0 (,`k)!,3 1
1 0 (,`k)!,4 1
2 1 `v 1
2 2 `v 1
2 3 `v 1
2 4 `v 1
2 5 `v 1
3 6 0 0
3 7 10 0
3 8 20 0
3 9 30 0
3 10 40 0

It converts the key part of the table into key dictionaries that are the parents of the value dictionaries in the table. And we can turn it back:

q)toDictFromTreeTable toTreeTable kv
k| v 
-| --
0| 0 
1| 10
2| 20
3| 30
4| 40
q)kv ~toDictFromTreeTable toTreeTable kv
1b

In other words, keyedTables are treated like dictionaries, this means that if you only want to look at values, you will only see values, simply by select from the treeTable where not d. The internal dictionaries inside a keyed table are broken apart into their component dictionaries and the values are stored independently.

Tables Get Treated as singletons.

Since tables are actually lists of dictionaries, and lists are treated as values. A table is also treated as a value and placed directly into the child column.

q)t:([] til 10)
x
-
0
1
2
3
4
5
6
7
8
9
q)toTreeTable t
l p c d
---------------------------------
0 0 :: 1
1 0 +(,`x)!,0 1 2 3 4 5 6 7 8 9 0

The function from TreeTable correctly undoes the toTreeTable function
but the treeTable form is actually more nested than the original table.

q)toDictFromTreeTable toTreeTable t
x
-
0
1
2
3
4
5
6
7
8
9

We can fix this by expanding our parent vector to notate whether a current element is a dictionary, list or an atom. That way we would create a node that is the head of every list and then iterate through the indexes in the list. This is left as an exercise, or until I need this functionality.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s