K for Good Thought: [Part 3] k,q,kdb tutorial adverbs, operators

Now we come to one of the most innovative features in APL family of languages. That is the language feature of operators.

An operator is similar to a verb (function) but unlike a verb which produces a noun an operator always produces a verb. It’s functions is very similar to an adverb in English.

OED

Adverb
A word or phrase that modifies the meaning of an adjective, verb, or other adverb, expressing manner, place, time, or degree (e.g. gently, here, now, very). Some adverbs, for example sentence adverbs, can also be used to modify whole sentences.

I will from now on refer to operators as adverbs so that our understanding can be enhanced by our natural understanding of language. An adverb serves to help us understand how a verb will modify a noun.

For example, in the previous tutorial we saw that we can create a verb like blend which applies to some arguments. Let’s simply define blend as taking two nouns and squashing them together. For instance the blend of cat and dog is simply catdog. In fact, we have seen this type of operation earlier with the comma(,) which allowed us to append the right argument to the left argument.

q)5# reverse til 15
14 13 12 11 10

q)list1: til 10
q)list2: 5# reverse til 15
q)list3:list1,list2
q)list3
0 1 2 3 4 5 6 7 8 9 14 13 12 11 10

list3 is now the longer list that contains both list1 and list2, order is preserved. So list1 will come before list2 and all the elements in list2 will still be in their original order. So to make this a bit more interesting, let’s say that blend does something a bit different. It adds a hyphen between the two arguments.

So blend[“cat”;”dog”] will create cat-dog. Here is a simple definition for blend.

blend: {x,"-",y}

We see that blend takes two implicit arguments, x and y, adds a hyphen in between and returns one single list.

What if instead of blending two items, we wanted to blend more. We could of course create a new function called blend3 that blends three items, but K/Q is much too cool for that instead it provides an adverb over in Q and “/” in K.

So for instance in Q:

 q)blend over ("cat";"dog";"mouse") 
"cat-dog-mouse"

in K:

k)blend/("cat";"dog";"mouse")

So we have generalized our function from two arguments to a list. What is happening is that blend is used with the first item from the list as the first argument and the second item as the second argument. Then the result is used as the first argument and the next item on the list is used as the second. This continues until there are no more items.

To get the intermediate results we can use scan instead of over in Q and \ instead of / in K.

Here is what that looks like:

q)blend scan ("cat";"dog";"mouse"; "horse")
"cat"
"cat-dog"
"cat-dog-mouse"
"cat-dog-mouse-horse"

Now we can generalize this further. By creating a verb that is similar to what python calls join. In python the join verb takes a list of items and adds a string separator in between each item in the list. This is often needed to make comma separated lists.

Here is K/Q implementation:

join:{y,x,z}

This verb takes three arguments, the first arguments will be placed between  the next two arguments.

q)join["!!!! ";"hello";"world"]
"hello!!!! world"

But if we use the projection idea from before which creates another function by supplying some of the arguments. Then we can easily see how to make blend from join.

blend: join ["-";;]

so we can simply write:

q)join["-"] over ("cat";"dog";"horse")
"cat-dog-horse"

OR:

q)join[","] over ("cat";"dog";"horse")
"cat,dog,horse"

We are starting to see sentences forming in K/Q. We read sentences in English from left to right and we do the same in K/Q. The sentences are actually evaluated in right to left order. So it is correct to read it as ‘join a comma over the list’ even though K looks at it  the other way.

It’s sort of like if you read the sentence

Lily jumped high over the yellow hedge.

We don’t know what she jumped over until the end yet we understand the sentence.

If we were to stage the sentence we would first need a yellow hedge then we need Lily and ask her to jump high. The staging of the sentence is how the computer interprets things, but we humans like to read sentences the first way.

This gets to a fundamental point about K/Q there are no precedence rules. The only slight exception is that you can use parentheses to  force pieces of a sentence to be evaluated together, much like you would do in English with hyphenation.

For instance in this phrase:

an off-campus apartment

We could say:

an apartment that is off campus

Then we would have the same meaning without hyphens and usually if we have a choice at least in Q/K we try to rearrange things so that we don’t need to use parentheses as keeping track of how many were opened and closed is hard on the eyes and forces the reader to scan the line many times.  As Donald Knuth said “Programs are meant to be read by humans and only incidentally for computers to execute.”  I will have another post on K/Q style, so back to our topic.

There are only six adverbs in K/Q and a user can not make their own but they are pretty powerful and can be used in many combinations. (Technically if you add your own functions into the q.k script you can create verbs that behave very similarly to adverbs. It is not recommended unless you are an advanced user and have found some particularly useful adverbs to add to the language.)

Let’s go through them and how they work.adverbs

Continue reading

Advertisements

Tables full of Lists or Not

I am taking a brief break from the tutorial series (I’m down the rabbit hole on adverbs and I haven’t sneaked/snuck back out).

While on the Aquaq Group I got the following message: (check out the opensource torq)

I’m trying to build a table that looks like

now                   y2
———————————-
1 8                     1 9
2 5 6                  5 13 7

This is a table with list of float as a column type. I manage to get something by doing the following

table:([] now:enlist `float$(); y2:enlist `float$())
.[`table;();,;((1;20;1);(1;7;9))]
.[`table;();,;((1;8);(1;9))]

I get

now                   y2
—————————————–
`float$()             `float$()
1 20 1               1 7 9
1 8                    1 9

Is there a way to create the table without the `float$()? Thanks.
I came up with some solutions:

The first just load the data in from your set

q)now: 2.0+til 4

q)y2:1.0*til 4

flip `now`y2!(enlist now; enlist y2)

Or initialize it the way it was done in the question and then immediately delete everything.

q)delete from `table where i>0

But the question got me thinking, that there was something wrong with the data model.

It seems unlikely that the best way to handle data is with lists of lists especially in q.

The reason is simple. simple lists in q/k are really fast. they are contiguous blocks of memory. A list of lists are essentially a list of pointers to smaller contiguous blocks. So to test this I created the data-structure that  I think would accomplish everything the questioner wanted:

 

First let’s create some dummy data and put it into the form of the questioner

q) x: ?[;1.0] each 1+100?9
q)y: ?[;1.0] each 1+100?9

?[;1.0] each 1+100?9 this code is two blocks. The first  ?[;1.0]  is a function that will return a list of random floats between 0 and 1 based on the number.

For example (til 10)!(10?1.0) This is just a dictionary where the key is a number from 1 to 10 and the value is the list of random floats.

0| 0.50994288921356201
1| 0.90591901540756226
2| 0.033651269972324371
3| 0.35632792115211487
4| 0.93293070793151855
5| 0.40784171223640442
6| 0.84914475679397583
7| 0.95187389850616455
8| 0.30074122548103333
9| 0.42668652534484863

 

The second part 1+100?9 create a 100 element list of random numbers between 1-9.

The each adverb applies the first function (which is a projection) onto each of the elements in this list.

Which gives us a list of lists of floats.

We can see this easily by applying the verbs count and type.

q) count x
100
q)count each x
1 1 4 9 6 4 7 2 1 8 6 9 4 5 7 1 2 6 4 7 8 4 8 2 2 9 4 5 9 2 9 6 3 5 1 2 9 6 2.. /random numbers in fact the same ones generated by the original command 1+100?9
q)(count each) each x
,1
,1
1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1
..

Similarly we can do the same with type

q) type x
0h /a mixed list
q)type each x
9 9 9 9 9 9 9 9 9 9 9 .. /the type for a list of floats
q)(count each) each x
-9 -9 -9 -9 -9 -9 -9 -9h
-9 -9 -9h
-9 -9 -9h
-9 -9 -9 -9 -9 -9 -9h
-9 -9 -9 -9h
-9 -9 -9 -9 -9 -9 -9h
-9 -9 -9 -9 -9 -9 -9h
-9 -9 -9 -9 -9 -9h
-9 -9 -9h
-9 -9 -9 -9 -9h
.. /the type for an atom float. (remember negative means atom)

So to create the table that the question was asking about we simply use table notation:

q)table:([]x;y)

x                                                                               y                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------
,0.069488948909565806                                                           ,0.61275040498003364                                                           
,0.18058245000429451                                                            0.074964647181332111 0.21553308074362576 0.98548699356615543 0.09629874653182..
0.62432846496812999 0.34225802053697407 0.16392032452858984 0.7786109980661422  0.49375488748773932 0.48304210440255702 0.49691119533963501 0.259749417193233..
0.53403982776217163 0.88706594565883279 0.81696920236572623 0.530728749930858.. 0.04861030331812799 0.13365717371925712 0.99836521386168897 0.152357474202290..
0.36319364630617201 0.15776891494169831 0.45724554779008031 0.974702587584033.. 0.87366164615377784 0.31437360099516809 0.12580484454520047 0.635657876031473..
0.56634279619902372 0.87585472338832915 0.78419443848542869 0.83294638991355896 0.78629159089177847 0.26546700927428901 0.12905376916751266 0.761838810984045..
0.087665317580103874 0.23685944243334234 0.043390074744820595 0.8760851731058.. 0.10706076468341053 0.85431159776635468 0.012844447279348969                   
0.32782130455598235 0.72754358011297882                                         0.89185469760559499 0.651039635296911 0.48640229227021337 0.68811673740856349..
,0.5765261803753674                                                             ,0.28950302815064788

Done.

Now to make the structure that we want to make, we need to turn these lists into one list. That’s easily done with a verb raze. However, we will lose all information about the list of lists. It just flattens the whole thing.

q) raze x
0.69158544205129147 0.92801779927685857 0.95605599298141897 0.909228001954033..

So we need to zip up each element with it’s row number.

We can easily zip each row and row number like this:

q) (til count x),’x
(0;0.069488948909565806)
(1;0.18058245000429451)
(2;0.62432846496812999;0.34225802053697407;0.16392032452858984;0.778610998066..
(3;0.53403982776217163;0.88706594565883279;0.81696920236572623;0.530728749930..
(4;0.36319364630617201;0.15776891494169831;0.45724554779008031;0.974702587584..
(5;0.56634279619902372;0.87585472338832915;0.78419443848542869;0.832946389913..
(6;0.087665317580103874;0.23685944243334234;0.043390074744820595;0.8760851731..
(7;0.32782130455598235;0.72754358011297882)
(8;0.5765261803753674)
..
What we really want is to zip up each item with it’s row number.
q) (til count x),”x
,(0;0.069488948909565806)
,(1;0.18058245000429451)
((2;0.62432846496812999);(2;0.34225802053697407);(2;0.16392032452858984);(2;0..
((3;0.53403982776217163);(3;0.88706594565883279);(3;0.81696920236572623);(3;0..
((4;0.36319364630617201);(4;0.15776891494169831);(4;0.45724554779008031);(4;0..
((5;0.56634279619902372);(5;0.87585472338832915);(5;0.78419443848542869);(5;0..
((6;0.087665317580103874);(6;0.23685944243334234);(6;0.043390074744820595);(6..
((7;0.32782130455598235);(7;0.72754358011297882))
,(8;0.5765261803753674)
.. /notice that the row number is now next to each element

Now we can simply raze this and get

q) raze (til count x),”x
0 0.069488948909565806
1 0.18058245000429451
2 0.62432846496812999
2 0.34225802053697407
2 0.16392032452858984
2 0.7786109980661422
3 0.53403982776217163
3 0.88706594565883279
3 0.81696920236572623
..

Because we need to account for the fact that these list can have variable size, it’s best if we give them also an element number this will also be necessary for creating the table.

Remembering that we can find the element number by using the idiom til count we add the elements place in the list with the following

q) (til each count each x),”x
,(0;0.069488948909565806)
,(0;0.18058245000429451)
((0;0.62432846496812999);(1;0.34225802053697407);(2;0.16392032452858984);(3;0..
((0;0.53403982776217163);(1;0.88706594565883279);(2;0.81696920236572623);(3;0..
((0;0.36319364630617201);(1;0.15776891494169831);(2;0.45724554779008031);(3;0..
((0;0.56634279619902372);(1;0.87585472338832915);(2;0.78419443848542869);(3;0..
((0;0.087665317580103874);(1;0.23685944243334234);(2;0.043390074744820595);(3..
((0;0.32782130455598235);(1;0.72754358011297882))
,(0;0.5765261803753674)

If we add back our row numbers we get

q)(til count x),”(til each count each x),”x
,(0;0;0.069488948909565806)
,(1;0;0.18058245000429451)
((2;0;0.62432846496812999);(2;1;0.34225802053697407);(2;2;0.16392032452858984..
((3;0;0.53403982776217163);(3;1;0.88706594565883279);(3;2;0.81696920236572623..
((4;0;0.36319364630617201);(4;1;0.15776891494169831);(4;2;0.45724554779008031..
((5;0;0.56634279619902372);(5;1;0.87585472338832915);(5;2;0.78419443848542869..
((6;0;0.087665317580103874);(6;1;0.23685944243334234);(6;2;0.0433900747448205..
((7;0;0.32782130455598235);(7;1;0.72754358011297882))
,(8;0;0.5765261803753674)
.. /the first number is the row number, the second the element number and finally the element itself

If we now raze this entire mess. We get a list of lists of three elements.

q)raze (til count x),”(til each count each x),”x
0 0 0.069488948909565806
1 0 0.18058245000429451
2 0 0.62432846496812999
2 1 0.34225802053697407
2 2 0.16392032452858984
2 3 0.7786109980661422
3 0 0.53403982776217163
..

Now we simply create a table from this normalized list. let’s call it xn

q)xn:raze (til count x),”(til each count each x),”x
q)tx:([rid:xn[;0];eid:xn[;1]]x:xn[;2])
q) tx

rid eid| x                   
-------| --------------------
0   0  | 0.069488948909565806
1   0  | 0.18058245000429451 
2   0  | 0.62432846496812999 
2   1  | 0.34225802053697407 
2   2  | 0.16392032452858984 
2   3  | 0.7786109980661422  
3   0  | 0.53403982776217163 
3   1  | 0.88706594565883279 
3   2  | 0.81696920236572623 
3   3  | 0.53072874993085861 
3   4  | 0.49852972221560776 
3   5  | 0.081481670029461384

Let’s pretend this is a cooking show and we have done all the previous steps to y.

Here is our y table

rid eid| y                   
-------| --------------------
0   0  | 0.61275040498003364 
1   0  | 0.074964647181332111
1   1  | 0.21553308074362576 
1   2  | 0.98548699356615543 
1   3  | 0.096298746531829238
1   4  | 0.60473616444505751 
1   5  | 0.98754768050275743 
1   6  | 0.46964235557243228 
2   0  | 0.49375488748773932 
2   1  | 0.48304210440255702 
2   2  | 0.49691119533963501

 

Now we simply need to join these two tables. Q makes this really easy with union join (uj) this will automatically join on the key columns if they are the same and add nulls where the data is missing.

q)t:tx uj ty
Some pretty clever stuff happens underneath. First the row ids are matched up and then the element ids are matched the columns will be matched up. If the second table has missing data it will be filled in with missing values. If one of the keys is missing in the first table then the second table rows will be appended to the end.

t now looks like:

rid eid| x                    y                   
-------| -----------------------------------------
0   0  | 0.069488948909565806 0.61275040498003364 
1   0  | 0.18058245000429451  0.074964647181332111
2   0  | 0.62432846496812999  0.49375488748773932 
2   1  | 0.34225802053697407  0.48304210440255702 
2   2  | 0.16392032452858984  0.49691119533963501 
2   3  | 0.7786109980661422   0.25974941719323397 
3   0  | 0.53403982776217163  0.04861030331812799 
3   1  | 0.88706594565883279  0.13365717371925712 
3   2  | 0.81696920236572623  0.99836521386168897 
3   3  | 0.53072874993085861  0.15235747420229018 
3   4  | 0.49852972221560776  0.55927985697053373 
3   5  | 0.081481670029461384 0.5587915435899049  
3   6  | 0.022724361391738057 0.73704647365957499 
3   7  | 0.99410609365440905  0.14246862730942667 
3   8  | 0.70526488753966987

Because Q will append the missing values from the second table to the end we need to resort the rid column so that we can easily find the rows that go together as in the original structure. This also explains why the only missing rows appear in the y column. Lets sort the rid.

q) `rid xasc `t

rid eid| x                    y                   
-------| -----------------------------------------
0   0  | 0.069488948909565806 0.61275040498003364 
1   0  | 0.18058245000429451  0.074964647181332111
1   1  |                      0.21553308074362576 
1   2  |                      0.98548699356615543 
1   3  |                      0.096298746531829238
1   4  |                      0.60473616444505751 
1   5  |                      0.98754768050275743 
1   6  |                      0.46964235557243228 
2   0  | 0.62432846496812999  0.49375488748773932 
2   1  | 0.34225802053697407  0.48304210440255702 
2   2  | 0.16392032452858984  0.49691119533963501 
2   3  | 0.7786109980661422   0.25974941719323397 
2   4  |                      0.4275001569185406  
2   5  |                      0.032995089888572693
2   6  |                      0.83006249205209315 
3   0  | 0.53403982776217163  0.04861030331812799 
3   1  | 0.88706594565883279  0.13365717371925712 
3   2  | 0.81696920236572623  0.99836521386168897 
3   3  | 0.53072874993085861  0.15235747420229018 
3   4  | 0.49852972221560776  0.55927985697053373 
3   5  | 0.081481670029461384 0.5587915435899049  
3   6  | 0.022724361391738057 0.73704647365957499 
3   7  | 0.99410609365440905  0.14246862730942667 
3   8  | 0.70526488753966987

 

We are done. We have preserved all the information of the original and normalized the Xs and Ys into a single list.

The original table

here is row 6 two different ways:

q) table 6

x| 0.1797088 0.8024157 0.5423326
y| 0.1165928 0.3872393 0.850714 0.04278559 0.6582815

q) select from table where i=6

x                             y                                                
-------------------------------------------------------------------------------
0.1797088 0.8024157 0.5423326 0.1165928 0.3872393 0.850714 0.04278559 0.6582815

In the new table we can get it easily only through sql select

q)select from t where rid=6

rid eid| x         y         
-------| --------------------
6   0  | 0.1797088 0.1165928 
6   1  | 0.8024157 0.3872393 
6   2  | 0.5423326 0.850714  
6   3  |           0.04278559
6   4  |           0.6582815

Getting back to the original table from this new refactored table is much easier:

q)select x:enlist x, y:enlist y by rid from t

rid| x                                                   y                                                
---| -----------------------------------------------------------------------------------------------------
0  | 0.09087531 0.6808 0.03380431 0.8703531 0.8429108    0.4163949                                        
1  | 0.4949937                                           0.5467487 0.5798376                              
2  | 0.3858707 0.2949814 0.8767788                       0.1909843 0.6298909 0.2454277                    
3  | 0.4772341 0.3541479                                 0.3746654 0.1032089 0.2343539 0.5393872 0.5799004
4  | 0.5165465 0.1865312 0.8763466                       0.4432449 0.3670384                              
5  | 0.6212155 0.6579                                    0.163703 0.8951596 0.1463084                     
6  | 0.1797088 0.8024157 0.5423326                       0.1165928 0.3872393 0.850714 0.04278559 0.6582815
7  | 0.7720778 0.7476806 0.8778866 0.9120672             0.955969 0.3724551 0.7092033 0.5509018 0.180806  
8  | 0.5743992 0.01259725 0.7083142                      0.3574173 0.06719808 0.9434847 0.60344           

However, our new vectorized table allows us to find data about these individual values and we can use it to zip up x and y values that might belong together or do any other fun calculations.

 

 

 

 

K for Good Thought: [Part 2] k,q,kdb tutorial verbs, functions

Last time, we got to list things and reorder them a bit,  but that’s kind of boring on it’s own. This second tutorial is concerned with being able to play, transform and generally do things with nouns. In English, these types of words, action words, are normally called verbs and I’ll let the OED define it.

OED 

Verb:

o          NOUN- Grammar

A word used to describe an action, state, or occurrence, and forming the main part of the predicate of a sentence, such as hear, become, happen.

o          VERB- Grammar

Use (a word that is not conventionally used as a verb, typically a noun) as a verb.

Interestingly, though there are two definitions for verb, we will mostly concern ourselves with the first, but I shall leave a small footnote on how the second is related.

The first definition is fairly trivial, that is most speakers of a language understand that you need a category of words that do things, even if you don’t need the category itself. That is we tend to ‘place’, ‘buy’, ‘blend’ and generally do stuff to things. This doing stuff is what I mean by verb. In math, in computer science and in programming languages there is a concept of ‘function’ and this concept performs much of what a verb does in English. It allows you to manipulate the stuff. To take a crude example, verbs are like little machines that allow us to change something about the stuff. So for instance, I am running. Which is to say ‘I’ am now in transit. I think belaboring this point serves to confuse rather than enlighten. So I leave it here. Anyway, I will use the word verb instead of function or something else, as I find those other terms have been overloaded with meaning (people think of very specific examples) especially for programmers.

Verbs take an atom or a list and produce an atom or a list. This list can be a list of atoms or a list of lists or a list of lists of lists….

Whoa, almost got carried away there.

Anyway, the more important thing is to remember that a verb and noun produce a new noun. For those inclined to equations: verb (noun) => new noun

Which can be read as verb applied to noun yields new noun. (don’t worry there will be little else I borrow from chemistry.)

For example, the verb shake produces shaken noun.

Where shaken is an adjective.

Sometimes there is no adjective to describe the new noun.

For example, take the verb blend. If I blend a cat with a dog. I might get catdog or dogcat or a mush, in any case, we can talk of having blended the cat and dog. Often we don’t have a word that describes the new thing. The predicate phrase is the result of the verb. So for instance, Gena went to the zoo, ‘went to the zoo’ is the predicate phrase and is the description of ‘Gena’ since English lacks a special phrase that describes a person who has gone to the zoo. Other times, we do have a phrase “shake this drink” the drink is now mixed or shaken, but it is most definitely not stirred. The predicate phrase is always a description of a subject. Predicate phrases or verbs transform the subject and give us a new subject. One that is described by the predicate phrase.

Because K is a descendant of APL, there are certain conventions which K kept but simplified and enhanced. I will therefore first explain what APL did and then switch to K’s simplification and enhancement.

In APL, all verbs just as in English take a noun and make it do something, the result is always just a noun that has done that thing. So if I(noun) run (verb) then I am-running (new state of ‘I’). Just as in English some verbs need two nouns to make sense or have a different sense if used on their own. For instance, I winked, means something different though related to I winked at him. So for example in K the ‘%’ symbol is the verb divide. If we just say divide seven in K %7, that returns  0.1428571. Which is the reciprocal of 7, or one seventh. Where as if we 21%7 we get 3. This what APL calls monadic and dyadic verbs.

Verbs can be of two types: monadic and dydadic. Simply put, monadic verbs take a noun or a list of nouns on the right and produce a noun or a list (that list can contain lists, as always I will cease making this point and assume it from here on in unless it is not true. Then I will make special mention). Dyadic verbs take a noun or list on the left and the noun or list on the right and create a noun or list.

In the first case % behaves like a monadic verb reciprocate (take reciprocal or multiplicative inverse). That is the verb only referred to the noun following it. In the second case % behaved more like the verb divide, which takes a number on the left and divides it into the right (pun intended) number of equal pieces. The result is the size of each piece. Though this is intuitive for verbs that apply to one or two nouns at a time. It does not make much sense for verbs that apply to more nouns.

This is where K made a change from APL and created it’s own convention, from here on in it’s back to K and Q.

Going back our verb ‘blend’ suppose we wanted to blend a cat, dog, and a rabbit. Where do we put the noun rabbit. Each noun, is often called an argument.

Argument: The noun or list that a verb does its action to is often called an argument. The term is helpful, because we can refer to verbs as taking some number of arguments. For instance, Monadic take one argument, the one on the right of the verb. Dyadic take two arguments, the one on the left and then the one on the right. (Sometimes in some textbooks these are called parameters. I will try to stick with argument.)

K has adopted the following convention. If you have a verb that takes a specific number  of arguments, then you do so by putting the verb first and then putting a open square bracket([) the arguments separated by a semicolon(;) and a closing square bracket (]). This is very easy to see:

In K 10÷5 can be written at least these two ways:
10%5
%[10;5]

The nice thing about the second notation, is that it generalizes to verbs that apply to more than two arguments. You would simply write:
blend[dog;cat;rabbit]

Because this convention is simple, all verbs that a user defines automatically follow it and there is no way for a user to create dyadic verbs that take a left and right argument. Instead user defined verbs always apply to the arguments in the order notated in the brackets. That is instead of the noun being to the right and left of the verb. They are now right and left respectively inside the brackets. There is a small caveat: this convention does not allow you to define a verb that applies to more than eight arguments. The reason is that keeping track of eight arguments is difficult enough and the limit suggests to the writer that possibly the verb they are building should be composed of several verbs, there are ways to get around this limit, but think of it as a guideline.

q simplified this concept of verb even further. All functions in k that can either be monadic or dyadic are always dyadic in q and a word is chosen to represent the monadic case. If you want the reciprocal you are going to have to ask for it directly reciprocal[7].

Alright, we have to answer two basic questions: what are the verbs that we can use?, ie the vocabulary of verbs and two what do we do if a verb is not in the vocabulary? define new verbs.

For the first I will link to a site: http://code.kx.com/nwiki/JB:QforMortals2/built_in_functions

Suppose that a verb is not in the vocabulary. For instance, suppose we want to build into q the idea of cheering up. First we need to define some nouns. Let’s use the emoticons to signify a scale of happiness.

Very simply we might have:

Crying – “:’(”
Sad – “:(”
Annoyed – “:/”
Neutral – “:|”
happy – “:)”
very happy – “:))”
laughing -“:D”
tears of joy – “:’)”

Alright so each of these emoticons, will symbolize possible emotional states. Because list of characters are themselves lists. We need to create a list of lists, the easiest way is to use the enlist verb.

emotions: enlist(“:’(“)

This creates the first emoticon as a separate entry in the list of emotions, the rest can be added by appending. Of course you can do it in one shot but this way, you can use enlist and avoid all kinds of annoyances from KDB trying to interpret all of the symbolic characters incorrectly. Note the semicolons between elements to separate the list of characters. For clarification, I’m putting spaces between the semicolons, but that is not necessary.

emotions,: (“:(” ; “:/” ; “:|” ; “:)” ; “:))” ; “:D” ; “:’)” )

one shot:
emotions:( “:'(“; “:(” ; “:/” ; “:|” ; “:)” ; “:))” ; “:D” ; “:’)” )

So we have emotions. Let’s define a verb that cheers someone up by one level. Let’s call it lift.

Definitions in k and q use the colon notation. Just like lists. But we use curly braces to show that we are creating one unified action. We also can specify an argument name. I will use ‘ce’ which stands for current emotion.

lift:{[ce] emotions:( “:'(“; “:(” ; “:/” ; “:|” ; “:)” ; “:))” ; “:D” ; “:’)” ); emotions[(emotions?ce)+1]}

That is the whole definition. As you can tell most of it is filled with defining the emotions. That’s this part:

lift:{[ce] emotions:( “:'(“; “:(” ; “:/” ; “:|” ; “:)” ; “:))” ; “:D” ; “:’)” ); emotions[(emotions?ce)+1]}

The next bit uses the ? verb to search. It takes a list and returns the number of the first match. If there are no matches it returns the number of elements in the list. Which is one after the last element (remember everything is starts from zero so a list with ten items has an empty slot at 10).

lift:{[ce] emotions:( “:'(“; “:(” ; “:/” ; “:|” ; “:)” ; “:))” ; “:D” ; “:’)” );
emotions[(emotions?ce)+1]}

The last bit increments the index by one and returns the emotion at that index. k,q automatically return the last statement in a verb definition.

lift:{[ce] emotions:( “:'(“; “:(” ; “:/” ; “:|” ; “:)” ; “:))” ; “:D” ; “:’)” );
emotions[(emotions?ce)+1]}

For example lift “:(” returns

“:/”

What about if we reach the end or we search for an emotion not in the list? In both cases lift will simply return an empty string “”.

Well we all know you can lift someone’s spirits, but what about depressing them. It would seem we should be able to depress them and maybe by a certain quantity.

depress: {[ce;q] emotions:( “:'(“; “:(” ; “:/” ; “:|” ; “:)” ; “:))” ; “:D” ; “:’)” ); emotions[(emotions?ce)-q]}

depress[“:|”;2]
returns “:(”

What we have done is add the argument q standing for quantity and now we can depress by a certain factor. Suppose we wanted to make a really simple version of depress that can only move an emotion down by one level. Let’s call it unlift.

unlift:{depress[;1]x}

Really short right?!

But let’s see what we did. First we took advantage of the fact that k,q has built in argument names for the first three arguments, they are x,y,z respectively. So I didn’t have to name it in the beginning explicitly like this

unlift: {[x] depress[;1] x}

Implicit Arguments: In k,q any verb you create can will automatically use x,  y,  z as the first 3 parameters. There are two equivalent ways to write a verb that converts the dimension to volume:

vol_3d:{[x;y;z] x*y*z}
vol_3d:{ x*y*z}

You can’t mix these two. So the following doesn’t work

vol_3d:{[h] x*y*h}

Now, to the really cool trick.  If you don’t fill in all the arguments, k will automatically create a new verb that is only missing the arguments you left out. So since I filled in that I would like to depress the emotional state by 1, and k doesn’t know which emotion I want to depress. It creates a new verb that depresses by 1. We then give that new verb a name and we now have a new verb. Note that if you get rid of the depress verb; k won’t be able to unlift. So technically, unlift has a dependency on depress.

This is known as function currying or projection in math and in some fancier programming languages, but don’t get scared by the name. It’s actually really simple. In APL, K, Q family this is almost always referred to as projection. The name comes from an analogy that most people are all familiar with. On a sunny day at high noon your shadow can follow you but can only move in two dimensions. It is projected onto the surface of the ground. The difference between your location and the location of the shadow, is that you can move in three directions {north,south} {west,east} and {jump(up,down)}. While your shadow can only move in two {north,south} {west,east}. The same idea is happening to a verb in K like addition (+), normally + adds two different numbers. But if we wanted to create verb named increment.
increment:+[1;]
This new verb always adds one to a number. Some of the flexibility is gone from the addition verb.

Finally (Footnote as promised) on the second definition of verb.

The second definition is the idea of taking a word that is not normally a verb and making it a verb. Well, remember lists. Lists had a very similar notation to how we use verbs.

For instance: emotions 3 or emotions [3] both return

“:|” the fourth element in the list.

In other words, lists are actually a special kind of verb! Specifically, they are verbs that transform a particular index number into whatever is at that place in the list. This applies even more broadly to a type of list in k,q called a map or dictionary. Here is a really simple example: (we use ! to define this structure)

emotionsdict: `crying`sad`annoyed`neutral`happy`vhappy`laughing`tofjoy! ( “:'(“; “:(” ; “:/” ; “:|” ; “:)” ; “:))” ; “:D” ; “:’)” )

crying  | “:'(“
sad     | “:(“
annoyed | “:/”
neutral | “:|”
happy   | “:)”
vhappy  | “:))”
laughing| “:D”
tofjoy  | “:’)”

and emotionsdict `crying returns
“:'(”

and emotionsdict?”:'(” returns
`crying

The point is that instead of just thinking of this as an emotion to emoticon dictionary we can also think of it as a verb that takes an emotion and displays it as emoticon. More on this here. Also our addition and multiplication tables behave like those functions. For instance add[2;3] returns 5.

This idea that all lists are really special verbs or that verbs are special types of lists is very freeing. Because the next time you want to make a list, first think is there a way for me to just make a verb.

For example, are the even numbers a list or a verb? They are both:

(0, 2, 4,6…) and {2*x}.

Which is easier?

K for Good Thought: [Part 1] k,q,kdb tutorial atoms, nouns

I am a new convert to the K/Q/J/APL cult. This makes me suited to write a tutorial for those who find this whole family of languages quite arcane.

The tutorial is called K for Good Thought, an homage to Ken Iverson’s notation as a tool for thought and the tutorials ‘learn x for great good’ meme. Leslie Lamport claims that until you have expressed your thoughts in writing you haven’t really thought. So there is something about being able to express yourself in K that allows for another way of thinking. Often the K solutions tend to be simpler and more general. Also, because K is a computer language interpreted by a machine that does exactly what you told it to do, it’s a much stricter reader of your thinking.

I should start by saying that this is a language that values brevity over almost everything. As Arthur Whitney said in an interview. If you can do the same thing and it’s shorter and not much slower he wants to hear about it. If it’s faster you better tell him.

Arthur Whitney is a man of few words and when a subject is very difficult and the explanation is dense and technical, few will march on. So with that, I aim to make this tutorial approachable for those who have had high school math and no programming experience. If  I venture out into jargon or CS concepts or high level math, I will try to explain the concepts first before diving into the syntax, though of course sometimes one will motivate the other. I believe that the less programming background you have the more likely the concepts will become natural to you since you won’t need to unlearn many bad habits.

There are many tutorials that give the history of these languages.  So I am not going to bother, except to say that apl was developed to allow programmers to work on the problem at hand and not on the low level implementation, though of course inevitably you will need to get your hands dirty with the inner workings of K as well, but those tend to be advanced topics. –quick credit to all the abridged and short materials by Arthur Whitney [AW] (K, Q, KDB)

Let us start at the beginning:

In the beginning, AW created atoms and they were good.

  • Atoms are indivisible.
  •  Atoms can be of different types.

    I promised not to use jargon, so when I do. You will find an explanation in a quote box like this:
    Types are things that allow us to distinguish between different kinds of data. That is, an Atom can be a number or letter or a timestamp. Not everything that you can think of is a type. For instance, perhaps you want to have an emotion as a type and you would use various emojis to represent them.
    [AW] has not seen it fit to add them to the types that K understands. However, you can create an emotional type. We will look at that later.

Some example atoms:

  • 5
  • 1b
  •  ‘string of letters’
  • `symbol
  • 2009.11.05D15:21:10.040666000

We cannot do much with atoms, just like we cannot really say much if we just had nouns. I guess we could classify different Nouns: person, place or thing.

K uses @: to interrogate and find out what something is. In q we can use ‘type’

Because K is terse it uses numbers to signal what type something is.

A negative sign in front means that the type is an atom.

  • @:5 = -7                                              An integer between negative 2^63 and positive 2^ 63
  • @:1b = -1                                            Boolean: Can be (1b) True or (0b) False
  • @: ‘s’ = -10                                           A charecter or letter
  • @:`symbol   = -11                                    A token (advanced) used to refer to one copy
  • @:2009.11.05D15:21:10.040666000 = -12  A Timestamp

Besides classifying we can make lists:

In English:

Animals: cat, dog, mouse

And in K we can do something similar.

  • integers: 1 2 3 4 5
  • animals: `cat `dog `mouse
  • alpha: (“a”;”b”;”c”)

Here something unexpected happens. While ‘animals’ is a list of symbols, ‘alpha’ is actually a list of characters which is what programmers call a string.

So alpha can also be written as “abc”. In fact, if you ask K to display alpha it will automatically display it as “abc” instead of “a”,”b”,”c”.

If we check what “abc” is

@:”abc” =  10

It’s positive. That’s because if the noun is actually a list then kdb will tell you what it’s a list of.

Oh by the way, I snuck in some notation. In English we often have a colon ‘:’ that tells us that a definition follows. Well K lets you do the same thing.

In K/Q all lists are semicolon(;) separated enclosed in parentheses. As a matter of convenience K /Q lets you define some common lists just by using spaces. So the following are equivalent.

(1;2;3) and 1 2 3  

So alpha is defined as a list of characters specifically a, b, c.

Lists in K are always ordered, which means that we can talk about the first or last element in a list.

In fact K has a couple of things things that are designed specifically for lists.

First lets introduce the count (!) in K or the til in q

This allows us to create lists of integers until that number starting from 0.

For example !10

0 1 2 3 4 5 6 7 8 9

til10: !10

til10

0 1 2 3 4 5 6 7 8 9

til10[1] /the second item on list; list start with zeroth item

1

til10[9] /the tenth item on the list; remember to start counting from 0

9

Let’s say we want just first 3 numbers

3#til10 /the pound sign or hashtag means ‘take’

0 1 2

-3#til10 /this takes the last 3 slots. The negative sign signifies that we are taking from the end of the list.

7 8 9

3_til10 /The underscore(_) is used to drop. if we only want what’s not in the first 3 slots

4 5 6 7 8 9

(3_til10) = (-7#til10)

1111111b

There is an equivalence between take and drop. Dropping the first 3 is equivalent to taking the last 7, if your list has 10 things. The equals (=) sign is used to compare things each slot from each list will be compared and return a 0 or 1. 1 if it is equal and 0 if the two elements in the slots are not equal.

In general we can say given a list with n items dropping the first x items is equivalent to taking the last n-x items. In our example, we have 10 items so dropping 3 is the same as taking the last 7.

What about reversing the order of a list?

In K it’s ‘|:’ in Q it is just called ‘reverse’.

reverse 7_til10

9 8 7

Finally, we can make lists of lists. In English we call these tables. In math we call them matrices.

m: (1 2 3; 4 5 6)

1 2 3
4 5 6

In K ‘+:’ is flip and in Q it is ‘flip’

in K: “+:m” is the same as “flip m” in Q

+:m
(1 4; 2 5; 3 6)
1 4
2 5
3 6

m[0] Returns the first noun, which is a list of numbers.

 1 2 3

Remembering the game battleship helps to see what is happening.

m[0][2]

3

BONUS SECTION

What if we wanted to make a list of lists of lists. We can! In math these are called tensors. In English these don’t really have a name, I think most people don’t have a use for them.

However, we all encounter these structures in the form of pages. Let me explain, If a line in a book is a list of words (or as k would have characters), then a page in a book is a list of lines and a collection of pages, a book, is a list of lists of lists. One way to reference a word in a book would be to tell you the page number (dimension 1), the line number (dimension 2), and the word number (dimension 3). To make this clear, lets do a pretty simple example. Let’s say that we wanted to write a math textbook and the first couple of pages were reference tables. The first table might be the addition table. The second table could be multiplication.

Let’s first build the addition table.

in K:

a:!10;add:(a+)’a

in Q:

a: til 10; add:(a+) each a;

add is now equal to

0 1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 10
2 3 4 5 6 7 8 9 10 11
3 4 5 6 7 8 9 10 11 12
4 5 6 7 8 9 10 11 12 13
5 6 7 8 9 10 11 12 13 14
6 7 8 9 10 11 12 13 14 15
7 8 9 10 11 12 13 14 15 16
8 9 10 11 12 13 14 15 16 17
9 10 11 12 13 14 15 16 17 18

We can see that indexing add[3;2] will give us the cell containing 5, which is what we would expect. Cell add[2;3] will also equal 5, but I highlighted add[3;2], notice row column order. A good way to remember this is to think of a regular list as a vertical structure. If the list has lists inside, then to get to a particular row and column, we need to first go down the first list and then enter the second level list and go to the right. So the top level is always a row number.  Basically, the first index, indexes the first level list. A deeper list will be a deeper index. Page number, line number, word number.

Let’s now create the mltp table.

in K:

mltp:(a*)’a

in Q:

mltp:(a*) each a

mltp

0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8 10 12 14 16 18
0 3 6 9 12 15 18 21 24 27
0 4 8 12 16 20 24 28 32 36
0 5 10 15 20 25 30 35 40 45
0 6 12 18 24 30 36 42 48 54
0 7 14 21 28 35 42 49 56 63
0 8 16 24 32 40 48 56 64 72
0 9 18 27 36 45 54 63 72 81

So our reference tables can be a list of tables:

reftab:(add;mltp)

reftab[0] is the first table or the addition table.

reftab[1] is the second table or the multiplication table

I leave  creating a subtraction table in reftab[2] to the reader. (hint use the comma(,) to join a subtraction table to reftab. Like so reftab:reftab,subtr;)

You can make a division table as well, you will see how K/Q handles division by zero.

Now let’s say you wanted to play 3 dimensional battleship.

Each player could either attack the surface or the submarine  layer. Suppose in this really simple game you had two ships, each 2 units long. and there were only 6 squares. 2*3 board.

Here is a sample board:

Layer 0 or surface:

0 0 1

0 0 1

Layer 1 or submarine level

1 1 0

0 0 0

I have a submarine and a boat. Both are two units long. We need three coordinates to track down a single location.

First let’s make the surface layer

s: 0 0 1

sur: (s; s)

Next we can make the submarine layer

easy way

sub: (1 1 0; 0 0 0)

fancy way

sub: 2#|:+:sur,enlist 0 0 0 /Don’t worry I’ll explain.

sur, enlist 0 0 0 /gives us a three by three table. Note the comma used to append a list to a list.

(0 0 1;0 0 1;0 0 0)

+: sur, enlist 0 0 0 /flips the table on it’s side, sometimes called transpose

(0 0 0;0 0 0;1 1 0)

|: +: sur, enlist 0 0 0 /reverses the flipped table only reverses the top level list

(1 1 0;0 0 0;0 0 0)

2# |: +: sur, enlist 0 0 0 /takes the first 2 rows of the flipped table

(1 1 0;0 0 0)

Now we can make the whole board

b: (sur; sub)

((0 0 1;0 0 1);(1 1 0;0 0 0))

b[0]

(0 0 1;0 0 1)

b[0][1]

0 0 1

b[0][1][2]

1

And obviously we can make lists of lists of lists of lists…. and everything will work the same. But I’m stopping here.

A Better Way To Load Data into Microsoft SQL Server from Pandas

This post is not going to have much theory.

Here was my problem. Python and Pandas are excellent tools for munging data but if you want to store it long term a DataFrame is not the solution, especially if you need to do reporting. That’s why Edgar Codd discovered, and Michael Stonebreaker implemented, relational databases.

Obviously, if I had the choice I wouldn’t be using Microsoft SQL Server [MSS]. If I could, I would do this whole exercise in KDB, but KDB is expensive and not that good at storing long strings. Other relational databases might have better integration with Python, but at an enterprise MSS is the standard, and it supports all sorts of reporting.

So my task was to load a bunch of data about twenty thousand rows — in the long term we were going to load one hundred thousand rows an hour — into MSS.

Pandas is an amazing library built on top of numpy, a pretty fast C implementation of arrays.

Pandas has a built-in to_sql
method which allows anyone with a pyodbc engine to send their DataFrame into sql. Unfortunately, this method is really slow. It creates a transaction for every row. This means that every insert locks the table. This leads to poor performance (I got about 25 records a second.)

So I thought I would just use the pyodbc driver directly. After all, it has a special method for inserting many values called executemany. The MSS implementation of the pyodbc execute  many also creates a transaction per row. So does pymssql.
I looked on stack overflow, but they pretty much recommended using bulk insert.Which is still the fastest way to copy data into MSS. But it has some serious drawbacks.

For one, bulk insert needs to have a way to access the created flat file. It works best if that access path is actually a local disk and not a network drive. It also is a very primitive tool, so there is very little that you can do in terms of rollback and it’s not easily monitored. Lastly, transferring flat files, means that you are doing data munging writing to disk, then copying to another remote disk then putting the data back in memory. It might be the fastest method, but all those operations have overhead and it creates a fragile pipeline.

So if I can’t do bulk insert, and I can’t use a library. I have to roll my own.

It’s actually pretty simple:

 

 

You create your connection, I did this with sqlalchemy but you can use whatever:


import sqlalchemy
import urllib
cxn_str = "DRIVER={SQL Server Native Client 11.0};SERVER=server,port;DATABASE=mydb;UID=user;PWD=pwd"
params = urllib.quote_plus(cxn_str)
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
conn = engine.connect().connection
cursor = conn.cursor()

You take your df.
I’m going to assume that all of your values in the df are strings, if they are not.
And you have longs, you are going to need to do some munging to make sure that when we stringfy the values sql knows what’s up.
(so if you have a long, then you need to chop the L off because MSS doesn’t know that 12345678910111234L is a number. here is the regex to get rid of it: re.sub(r”(?<=\d)L”,”,s ) credit to my coworker who is a regex wiz.

We need to tuplify our records, Wes Mckinney says:

records = [tuple(x) for x in df.values]

Which is good enough for me. But we also need to stringfy them as well.

So that becomes:

records = [str(tuple(x)) for x in df.values]

MSS has a batch insert mode that supports up to 1000 rows at a time. Which means 1000 rows per transaction.

We need the sql script that does batch insertion: we’ll call that insert_

insert_ = &quot;&quot;&quot;

INSERT INTO mytable
(col1
,col2
,col3
...)
VALUES

&quot;&quot;&quot;

Since we can only insert 1000 rows. We need to break up the rows into 1000 row batches. So here is a way of batching the records.
My favorite solution comes from here.

def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

Now all we need to do is have a way of creating the rows and batching.
Which becomes:

for batch in chunker(records, 1000):
rows = ','.join(batch)
insert_rows = insert_ + rows
cursor.execute(insert_rows)
conn.commit()

That’s it.
This solution got me inserting around 1300 rows a second. A 1.5 orders of magnitude increase.

There are further improvements for sure, but this will easily get me past a hundred thousand rows an hour.

Kafka On the Shore: My Experiences Benchmarking Apache Kafka Part II

This is part II of a series on Benchmarking Kafka Part I can be found here:

In the first part we used Spark to blast a 2gb csv file of 10 million rows into a three machine Kafka cluster. We got the speeds down to about 30 seconds. Which means it would take about 4 hours to blast a Terabyte. Which is fast, but not blazing fast.

Any number I put here  will become obsolete within a year, perhaps sooner. Nevertheless, I’ll put myself out there. If on modest hardware we could achieve 1 terabyte in 40 minutes that would be enough I think to impress some people. which is about 400mb/s

Now again, because of Kafka’s memory flush cycle. We can only get the speed we want up to 8gb per machine. Really less, because there is some Ram usage by the os itself and any other applications running on those machines, including in my case Spark usage.  So conservatively we can try and get 4gb per machine. At 400mb/s for two minutes straight.

Using some tricks, this kind of throughput can be accomplished on pretty modest hardware.

  1. no replication
  2. partitioning

Now the hard part is finding a machine gun that can fire those messages that fast. A distributed solution seems like the best move and replicates real world type of messaging many sources each blasting away messages.

So I fire up a spark instance and load in a large csv file of 10 million rows ~1.8gb. I re-partition the data set to take advantage of the number of cores available to me. And then I run the mapPartitions function, which allows each partition to independently of all others blast kafka with all of it’s messages, eliminating much of the overhead.

I then get a sustained message blast of about 200mb without the machine falling over.

[my interest in benchmarking kafka has been temporarily put to rest. I no longer have a kafka cluster at my disposal and looking at more local messaging solutions; both are zero broker:

Zero MQ

Aeron]

Dosage Search [Part 2]

So since publishing the first post on this topic. I have researched this question.

First I need to acknowledge, that my CS friends have all pointed out the similarity between this question and the two egg puzzle.

You are given two eggs and placed in a 100 floor building. You need to find the highest floor where the egg could survive a drop.

If you only have one egg, then you obviously just go up one floor at a time. Eventually the egg breaks or you reach floor 100. In first case, the floor below is the answer. In the second,  floor 100, since there are no higher floors.

If you have 10 eggs. You can just do binary search and find the right floor very quickly.

Since you have two eggs you can do better than one egg. The post I linked to earlier has a solution that tries to ensure the same number of drops as the worst case of binary search. Another approach is to drop the first egg on 2^n floor. So the first time would be from floor 2 then from floor 4, then from floor 8 and that way quickly climb up the building. This is similar to how network connections back off when a service is not responding. Retrying less frequently as the number of retries goes up.  (This allows the service to restart without falling over from the bombardment of ever more frequent requests)

As similar as this egg approach is, it fails to capture something that I think is essential to the dosage question. That is that, there is a difference between making a mistake of dropping the egg the fifty floors too high and making a mistake by one floor.  — Even if during the search you can’t tell, as that would be an obvious way to find the result in two tries.

So, I ran the code from the previous section and got some result for lists that had up to 1000 elements. However, python started to take too long for larger lists.

This prompted me to rewrite the functions into q. q and it’s sibling k, is an obscure language except to those in the financial services industry. It was created by a genius, Arthur Whitney, whose interview speaks for itself. The language itself is a derivative of APL and Lisp with a few more good ideas thrown in for good measure.  (More on this language in a different post)

Anyway the same algorithm as before which in python took 30+ hours ran in 20 minutes in q.

The relative speed of the k code. Allowed me to see a pattern for the divisor. As the list grows longer, the divisor that we use to split the list should get larger, albeit slowly.

Data (More at the bottom)
Number of elements in list divisor Average cost
50 5 7.78
100 6 11.48
500 13 27.128
1000 17 38.85
4000 31 78.87175
5000 35 88.3382
10000 49 125.4369

 

Since the divisor wasn’t growing for lists that were similar in size, I realized the mistake I made by keeping the divisor constant. Obviously as the list gets shorter we need to update the divisor that we are using to divide the list, since the problem has been reduced to searching a smaller list.

 

This led me to create another type of binary search with an updating divisor. The divisor grows proportional to log(n)^3.

This also increased the speed of the search since I was no longer doing linear search on any list smaller than the divisor I was testing. To explain: if you have a static divisor, then when you start you have a list of 1 million elements. So you divide by 200 and you know if the item you are searching for is in the first range [0:5,000) or in the second range (5000: 1,000,000]. However, gradually the list gets smaller, but you keep dividing the list in this very uneven way, so that when the number of elements is less than 200, you keep looking at the first element. This is equivalent to linear search.

If instead we start to divide the list by smaller divisors, then we can get closer to binary search and since much of the list is eliminated our chances of making a huge mistake are also smaller.

Here is the q code with the two versions of search: (this might be unreadable to most people but I plan on having a tutorial series on k, q, kdb soon)

dif_cost: {:x-y};
trivial_cost: {x+y; :1};
scalar_cost: {x+y; :100};
/binary search with cost function
st: {[l;item;cf;d] cost:0; while [((count l) > 1); m: (count l) div d; $[l[m]<item; [cost+:1; l: (m+1)_l]; l[m]>item; [cost +: (cf[l[m]; item]); l:m#l]; :cost+:1];]; cost};
/calculates average cost for a particular divisor and n, by searching for each element 
/then avg all those costs
st_avg: {[n; d] i: til n; res: @[i; i; st[i; ; dif_cost; d]]; avg res };
 
/iterates over divisors only going from 2 to 10 *( floor (log (n)))
/example for 100 it goes from 2 3 4 … 37 38 39 
/ this covers the divisors that are relevant and minimizes the cost and removes unneeded computations
st_div: {[n] d: $[n<50; 2_til n; 2_til 10 * floor log n]; res: st_avg[n] each d; d[first where res = min res],min res}

 

/Algorithm with updating divisor

s_auto: {[l;item;cf, f] cost:0; while [((count l) > 1); d: max (2, floor (log (count l) xexp f) ); m: (count l) div d; $[l[m]<item; [cost+:1; l: (m+1)_l]; l[m]>item; [cost +: (cf[l[m]; item]); l:m#l]; :cost+:1];]; cost};

/Then we can minimize over f, which is the factor that we exponentiate the log N.

 

 

Data continued:

num d cost
50 5 7.78
100 6 11.48
150 7 14.32667
200 8 16.755
250 9 18.768
300 10 20.72667
350 11 22.49143
400 11 24.15
450 12 25.66
500 13 27.128
550 13 28.51455
600 13 29.85833
650 14 31.12
700 14 32.34143
750 14 33.50267
800 15 34.62875
850 15 35.75529
900 15 36.84667
950 16 37.84526
1000 17 38.85
1050 17 39.83143
1100 17 40.78818
1150 17 41.75652
1200 17 42.7125
1250 19 43.6
1300 19 44.46538
1350 19 45.35333
1400 19 46.18
1450 19 47.02966
1500 19 47.84467
1550 19 48.68581
1600 20 49.46
1650 21 50.25697
1700 21 51.01176
1750 21 51.79314
1800 21 52.54167
1850 21 53.27514
1900 22 54.02211
1950 22 54.73641
2000 22 55.4305
2050 23 56.16927
2100 23 56.82714
2150 23 57.52884
2200 23 58.20273
2250 23 58.88222
2300 23 59.55609
2350 25 60.20553
2400 25 60.85167
2450 25 61.47714
2500 25 62.0944
2550 25 62.74235
2600 25 63.33962
2650 25 64.00113
2700 25 64.59259
2750 26 65.21273
2800 27 65.84
2850 27 66.39614
2900 27 67.0331
2950 27 67.58475
3000 27 68.20133
3050 27 68.73803
3100 27 69.29355
3150 28 69.88635
3200 27 70.44688
3250 28 71.00954
3300 28 71.56636
3350 29 72.11164
3400 29 72.64853
3450 29 73.16986
3500 29 73.70571
3550 30 74.30901
3600 29 74.79778
3650 29 75.31123
3700 31 75.84622
3750 31 76.37627
3800 31 76.87974
3850 31 77.36545
3900 31 77.85179
3950 31 78.39165
4000 31 78.87175
4050 31 79.37975
4100 31 79.88659
4150 31 80.37084
4200 31 80.87167
4250 32 81.35765
4300 31 81.8307
4350 33 82.3223
4400 33 82.81409
4450 32 83.28472
4500 33 83.76067
4550 33 84.21473
4600 33 84.69391
4650 33 85.15935
4700 33 85.60809
4750 33 86.07411
4800 33 86.555
4850 33 86.98887
4900 33 87.45633
4950 35 87.92222
5000 35 88.3382
10000 49 125.4369

 

 

Dosage Search [Part 1]

Sometimes, you come across problems that you think should be well studied but aren’t.

(please correct me if I’m wrong)

Medicine is famous for not having good cross pollination with the maths. See this famous paper which rediscovers the Riemann sum.

However, as bad as medicine is at noticing math, most disciplines are especially bad at adapting the study of algorithms, one of the most theoretical branches of Computer Science. This is true even for seasoned programmers.

There is a reason for this. Brute force works remarkably well. In fact, so well that Ken Thomspon, author of the Unix operating system, said:

“When in doubt, use brute force.” – Ken Thompson

There are some geniuses in the field, notably  Don Knuth. In Coders At Work some of  the best Computer Scientists admit to reading little to none of his series The Art of Computer Programming, saying the math is too difficult.

So, it’s not surprising that medical dosages are done with brute force. There is a method for picking the starting point, called median effective dosage which produces a result in 50% of the population.

However, doctors are averse to either overdosing or underdosing depending on the condition. They tend to start at some point and then proceed straight through the search space until they get to the right level. This algorithm is called linear search. It is brute force and it is guaranteed to find the right level should it exist.

There are much better ways to do a search assuming a sorted list. Let’s create a game to illustrate this:

Suppose I chose a number between 1 and 100 and You had to guess which number I chose. Every time you guessed, you pay me a dollar and I’ll tell you if you are above or below. If you guess correctly I pay you 10 dollars. Clearly, you are not going to start at 1 and just keep guessing up to 100.  Instead you can start at 50 and then eliminate half the range. If you keep halving what’s left you can find any number in 7 tries. Thus earning 3 dollars of profit.

The process of halving the interval is called binary search and it works great when the cost of guessing to high or to low is the same. In this case you paid a dollar for each guess. Suppose though instead, you paid 2 dollars if you guess above and only 1 dollar if you guess below. What strategy should you pursue?

What if you pay the difference  between the guess and the true answer and 1 dollar if it’s below. Obviously, you would pay this at the end of the game so as not to reveal the answer immediately. What does the reward have to be to make this game fair and what is the optimal strategy. In effect, this is the game doctors play when they are prescribing a dosage. Every guess, costs the patient time and health, but there is a higher cost to over prescribing or under prescribing. And this asymmetry can be captured in such a game.

A possible solution is to divide the interval into 4 and then guess that number. Keep doing that and you are guaranteed to get an answer as well and you’ll also guess under about 3/4 of the time. The right way to sub divide the interval and guess depends on the cost of each guess and the cost of guessing over. Is there an answer? Sure. But I haven’t seen a medical journal that has researched this question and so we do brute force, and that’s a problem.

Below is python code that simulates these questions:

def linear_search(alist, item, cost_func):
	first = 0
	last = len(alist)-1
	found = False
	cost = 0
	while first&lt;=last and not found:
		current = first
		if alist[current] == item:
			found = True
		else:
			if item &lt; alist[current]:
				cost = cost + cost_func(alist[midpoint],item)
			else:
				cost = cost + 1
				first = current+1
	return cost+1
def binary_search(alist, item, cost_func, cost=0, divisor=4):
	blist = list(alist)
	if len(blist) == 0:
		return cost
	else:
		midpoint = len(blist)//divisor
		if blist[midpoint]==item:
			return cost + 1
		else:
			if item&lt;blist[midpoint]:
				c = cost + cost_func(blist[midpoint],item)
				return binary_q_search(blist[:midpoint],item, cost_func, c)
			else:
				c = cost + 1
				return binary_q_search(blist[midpoint+1:],item, cost_func, c)
def binary_q_search(alist, item, cost_func, cost=0, divisor=4):
	blist = list(alist)
	if len(blist) == 0:
		return cost
	else:
		midpoint = len(blist)//divisor
		if blist[midpoint]==item:
			return cost + 1
		else:
			if item&lt;blist[midpoint]:
				c = cost + cost_func(blist[midpoint],item)
				return binary_q_search(blist[:midpoint],item, cost_func, c)
			else:
				c = cost + 1
				return binary_q_search(blist[midpoint+1:],item, cost_func, c)
def trivial_cost_func(guess, true):
	return 1
def scalar_cost_func(guess, true):
	return 100
def dif_cost_func(guess, true):
	return (guess-true)
lin = []
binar = []
quar = []
cost_func = scalar_cost_func
a = 1000
for i in xrange(0,a):
	lin.append(linear_search(xrange(0,a),i, cost_func))
	binar.append(binary_search(xrange(0,a),i, cost_func))
	quar.append(binary_q_search(xrange(0,a),i, cost_func,divisor=4))

	print &quot;The average cost for linear search: &quot; + str(sum(lin)/float(a))
	print &quot;The average cost for binary search: &quot; + str(sum(binar)/float(a))
	print &quot;The average cost for quarter search: &quot; + str(sum(quar)/float(a))

Continued in Part 2

 

Optimal Churn

There is a new approach that studies systems as organisms and classifies them as super or sub linear.

The idea is simply to plot the growth against the rate of growth. In other words at each point what will the next increase look like. For instance, cities are systems that seem to have superlinear growth. That is once they start growing they keep growing. Because more people make create more opportunities and so this positive feedback cycle makes the city grow without dying. Of course any system with uncontrollable positive feedback will go out of control (like a petri dish where eventually the bacteria grow faster than the available food source and they all die mass extinction) luckily we have some negative feedback mechanisms namely prices. So when cost of living goes up near by areas become part of the City. And thus the boundary of the City expands. Though at times, it should be noted cities experience shrinkage (which the authors of this research seem to ignore )

However companies are sublinear. That is why they die. The claim is they usually they die because as an organization gets bigger its tolerance for diversity decreases and the administrative tasks don’t scale. These combined forces cause the company to become less competitive in their industry, especially when combined with an ecological shift. That is all the advantage of being large is lost when the competition can be nimble and this is highlighted when the landscape changes.
This article talks about how scaling works.

Another system entirely ignored in these studies, which I believe is revealing are universities. They tend not to die and they tend to not grow all that much. One of the ways that they maintain a steady state is that they have a constant stream of new students who want to contribute to the institution this keeps the institution alive and relevant.

Companies are often wary of churn,  but it can be a good thing. Especially when combined with the right incentives. If young employees come in work really hard get promoted and then retire but pass on their experience to the next generation and leave. That is actually a good thing. The faster you can train employees, the less bloated your bureaucracy can become. Also an additional benefit is that you get highly motivated employees who stay highly motivated until they leave.  Furthermore, adding to the work force becomes easier since jobs are well defined and adding people happens often.

An example of this is: the classical trading job. Most people fall into tracks from this job. Rise and retire or burn out. Few would rise and become senior managers. The knowledge that spots on top are constantly being vacated motivates employees to work hard to get those spots. The commmon practice of retiring or leaving in general provides some incentive to pass along experience to the next generation. The constant flow through the system provides feedback and molds it to be efficient at training and self maintaining the flow.
In fact, one possible causal mechanism for failure to adapt to technology by the banks was that the average age of traders rose during the 80s because of several trends. There was economic uncertainty and low birth rate during the 1970s. That meant that there wasn’t a significant enough flow through the companies so instead of adapting these companies, became conservative and static.

St Petersburg Paradox:

I learned about this paradox recently through an interview question.

The paradox comes from the following game. more here at wiki:

The casino flips a coin until it gets heads. The number of tails is squared and payed to the player. How should they price this game?

 

Having never seen this question, I proceeded to take the expectation. That is the weighted average of the payouts. Well the answer is infinity, this can be seen intuitively from the fact that the payout grows exponentially at the same rate as the diminishing probability of the event happening.

The paradox is that no one will place huge sums of money to play this game, even though theoretically, they should – since it has an unlimited expected payoff.

The variance of this game is also infinite which stems from the fact that the payouts are finite but the expectation is infinite. E(X) = (X-E(X))^2

There are several resolutions:

A utility function approach which says we need to discount the payoffs by the utility function. In other words, the first million dollars brings me more utility then each next million. However, if this is not bounded, you can simply keep increasing the speed of payouts given any unbounded utility function.

Another approach suggests that people discount the probability of negligible events. However, lottery ticket sales undermine this argument, since those seem to be low probability events that people pay too much for. However, this counter argument neglects to mention that certain events are discounted completely if they are below a certain probability. As an example, the chance that all the air will decide to stay in one side of my tire, nobody will pay any amount of money for that event. Same goes for the law of gravity to stop applying, there is some negligible probability for that event, but no matter how large the payoff no one will by that ticket.

An experimental approach, suggests that you should pay around 10 dollars. Having played the game 2048 times.

Another approach suggests that you would need the casino to be good for an infinite sum of cash and since they aren’t no one would place that money.

A combination of utility and the previous reason, gives an answer that is close to the experimental result. Suppose you would be willing to play and suppose it takes a minute to flip a coin. You have a magic genie that has guaranteed that the coin will flip tails until you say heads after which it will be heads. How long will you play?

Most likely, you will stop after you are the richest person in the world, but that only will take an hour. After that you apparently have more money than the value of earth by three orders of magnitude. If you discount to that best case scenario, you get no paradox at all, in fact the most you would then pay is 1/4*60 = 15 dollars. If you understand, that casino can’t possibly guarantee more than a billion in winnings, that brings the value down to 29.8973529 ~ 30, which says it’s closer 7.5 dollars. If you are playing against another person you can’t expect more than a million dollar payout so you shouldn’t bet more than 1/4*20, unless you are playing against Harry Kakavas.

One last way to think about this problem. What is the expected value of the game itself. The answer to this is 1. That is the total number of flips will be 2 but you will only expect one tail. In which case, you expect 1 dollar. So perhaps you should simply pay 1 dollar because that’s how the game will play out.This is called the median value of the game and is in fact what many people bet.

The fact is, the more money you have the more valuable your time becomes, and it makes less and less sense to keep playing the game. So you will tell your genie to stop, because now that you have earned all this money you want to go out and spend it.