Vector thinking for Max Profit

Occasionally I am lucky enough to run into a problem that doesn’t already have many vector solutions, so I can demonstrate how vector thinking can allow you to find an elegant solution. To build up to my insight let’s solve 2 easy variations and then we can get to the two more interesting puzzles.
The problem in question is called Best Time to Buy and Sell or more popularly simply Max Profit.
The theme is always the same you are given an ordered list of positive numbers that represent the evolution of the price of the security. You must find the maximum profit you could have made given this price series. You can only hold 1 unit at a time and you can not short, which means you must buy before you can sell (you can only profit from the price evolving upward).
The simplest and most well know variation is the 1 transaction problem (leetcode problem 121): you can only buy and sell at most 1 time what is the max profit that can be earned from the following series:
example series:
3 2 5 1 3 2 4 9 5 12 3 5
The key insight into this problem is that we are looking for the largest difference between the current element and the smallest element we have seen.
To motivate this insight, let’s look at ever larger versions of this problem starting with just one price to detect the pattern.
3 -> 0
Given just one price we can only buy and sell on the same day, so effectively max profit is 0
3 2 -> 0
Given these two points we can still only earn 0, since we must sell after we buy.
3 2 5 -> 3
Now that we have a higher number after a lower number we can plainly see that of the two options buy at 3 or buy at 2, buy at 2 and sell at 5 is better.
3 2 5 1 -> 3
The answer is still 3, since we can’t sell higher than 5, and even if we bought at 1 at the end there are no days left to sell
3 2 5 1 3 -> 3
We still do best by buying at 2 and selling at 5, buying at 1 and selling at 3 only earns 2.
3 2 5 1 3 2 -> 3
Plainly the 2 at the end doesn’t improve
3 2 5 1 3 2 4 -> 3
We can now either buy at 2 and sell at 5 or buy at 1 and sell at 4, but our max profit is still 3.
3 2 5 1 3 2 4 9 -> 8
Finally, the max profit changes! We can now buy at 1 and sell at 9. As new elements are added the new reward will be a function of the lowest element seen thus far.
We then get our solution:
{max x-mins x} 3 2 5 1 3 2 4 9 5 12 3 5
11, where we buy at 1 and sell at 12.
Or in python we could say something like:

import itertools
from operator import sub
def max_profit(p):
    return max(map(sub,p,itertools.accumulate(p,min)))

To unpack this a bit, mins returns the min element up to that point:
mins[3 2 5 1 3 2 4 9 5 12 3 5]
3 2 2 1 1 1 1 1 1 1 1 1
we can then do element-wise subtraction to see what the profit is from selling at each point
0 0 3 0 2 1 3 8 4 11 2 4
taking the running max we get the max profit so far,
0 0 3 3 3 3 3 8 8 11 11 11
we can see that this series is exactly the results that we got when we were looking at the max profit for the prefixes of the series.

Easy enough, let’s look at version 2 of this problem (leetcode 122):
unlimited number of transactions: We are now allowed to buy and sell as often as we like provided we never own more than 1 share of the stock and we still must sell after we bought. To make things easier we can assume that we can buy and sell on the same day, in practice this doesn’t matter as any day we do this we can consider that we did nothing.
Keeping the same price series as before:
3 2 5 1 3 2 4 9 5 12 3 5
We now notice that we should take advantage of every single time the price goes up. Again let’s look at a smaller example to get some intuition.
3 2 5 -> 3
Nothing fancy here, we buy at 2 and sell at 5, purchasing at 3 doesn’t improve our profit since selling at 2 would incur a loss.
3 2 5 1 3 -> 5
buy at 2 sell at 5, buy at 1 sell at 3
3 2 5 1 3 2 4 -> 7
just add one more purchase at 2 and sell at 4,
3 2 5 1 3 2 4 9 -> 12
Here it becomes interesting, we can look at two interpretations
buy at 2 sell at 5 (3),buy at 1 sell at 3 (2), buy 2 sell at 9 (7) for a total of 12
or we can say:
buy at 2 sell at 5 (3),buy at 1 sell at 3 (2), buy at 2 sell at 4 (2),buy at 4 sell at 9 (5) for a total of 12.
The two approaches are identical since addition commutes, it doesn’t matter how you get from 2 – 9 you will always earn 7.
Which means that we can simply add up the positive steps in the price series. That will be the maximum profit for an unlimited number of transactions:
Our code is simply:
{(x>0) wsum x:1 _ deltas x} 3 2 5 1 3 2 4 9 5 12 3 5
21
Deltas gives us the adjacent differences. We drop the first delta since we cannot sell before we have bought. If the delta is positive >0 we want to include it in our sum. Using (w)eighted(sum) we weight the >0 with a 1 and less then 0 or =0 are given a weight of 0.

Now that we covered the case with 1 transaction and unlimited transactions, we should feel confident that we can tackle 2 transactions. Lucky for us, that’s exactly version III of the problem (leetcode 123):
Getting insight into how to solve this variation should start with thinking about how we could solve this naively. The correct heuristic to reach for is divide and conquer. Suppose that for every step along the price evolution we knew what the max profit was before and the max profit was after. Then we could sum those two max profits and take the largest combination.

assume max_profit function as before:

r=range(len(p))
max([max_profit(p[0:i])+max_profit(p[i:]) for i in r)])

For our example this gives us 15.

What we might notice is that we are recomputing the max profit over for larger and larger prefixes, and for smaller and smaller suffixes.
We know that the prefixes were already computed simply by taking the rolling max instead of the overall max in max_profit. The only thing we’re missing is how to calculate the max_profit of suffixes.
Now we notice a symmetry – while the prefixes are governed by the rolling minimum, the suffixes are bounded by the rolling maximum from the left, i.e. the largest element available at the end to sell into.

To get a sense of this, look at the solutions to the suffixes
3 5 -> 2 (buy at 3 sell at 5)
12 3 5 -> 2 (buy at 3 sell at 5)
5 12 3 5 -> 7 (buy at 5 sell at 12)
9 5 12 3 5 -> 7 (buy at 5 sell at 12)
4 9 5 12 3 5 -> 8 (buy at 4 sell at 12)
Which leads to this solution in q:

{max reverse[maxs maxs[x]-x:reverse x]+maxs x-mins x}

maxs x -mins x / is familiar from earlier
If we want to walk backwards through a list, we can reverse the list and apply the logic forwards then reverse the result, which is the same as applying the logic to the suffixes.
So all we need to do is to subtract each element from the running maximum then take the running max of this result to get the max profit for the suffix so far.
In Python this simply becomes:

from itertools import accumulate as acc
from operator import add,sub
def max_profit_2trans(p):
    before=list(acc(map(sub,p,acc( p,min)),max))
    p.reverse()
    after=list(acc(map(sub,acc(p,max),p),max))
    after.reverse()
return max(map(add,b,a))

Alright, finally we should be able to tackle the most general version of this problem (leetcode 188). You are given both a series of prices and allowed to make up to k transactions. Obviously, if k is 1, we solved that first. We just solved k is 2. You can verify for yourself, but if k is larger than half the length of the price series the max is the same as the second problem we solved. Since every pair of days can allow for at most one transaction, effectively k is unlimited at that point.
We need to solve the k>2 but less than k<n/2. The standard CS technique that comes to mind is some kind of dynamic programming approach. To quote MIT’s Erik Demaine CS6006 dynamic programing is recursion plus memoization.
Let’s setup the recursive solution and assume we are at a particular (i)ndex in our price series, with k transactions left.
Let’s start with the base cases:
If k equals 0 we return 0
If i is greater than the last index, i.e. there are no elements left in the list: we return 0.
Otherwise the solution to the problem is simply the maximum of 2 options:

do nothing at this step:

0+the function increment i

do something:

If we are (h)olding a share we can sell at this step which adds the current price +the result of this function with one less k and i incremented by 1.
Otherwise we buy at this step, which is minus the current price (we spend money to buy) plus the result of this function with i incremented and we are now holding a share.

Here is the code for this in q:


p:3 2 5 1 3 2 4 9 5 12 3 5
cache:(0b,'0,'til[count p])!count[p]#0
f:{[h;k;i]$[i=count p;0;(h;k;i) in key cache;cache[(h;k;i)];
:cache[(h;k;i)]:.z.s[h;k;i+1]|$[h;p[i]+.z.s[0b;k-1;i+1];.z.s[1b;k;i+1]-p i]]}

We can test this and see that it results in the right answer for k=0,1,2
f[0b;0;0] -> 0 no surprise with 0 transaction no profit is possible
f[0b;1;0] -> 11 the original problem
f[0b;2;0] -> 15 buy at 1, sell at 9, buy at 5 sell at 12

However, we know that a vector solution is possible for k=1,2,infinity, so we might hope there is a vector solution for when k is 3 and above.
We can analyze the intermediate results of the cache to get some sense of this.

code to look at the table

t:(flip `h`k`j!flip key cache)!([]v:value cache)
exec (`$string[j])!v by k from t where not h 

output table:

 | 11 10 9 8 7 6  5  4  3  2  1  0 
-| --------------------------------
0| 0  0  0 0 0 0  0  0  0  0  0  0 
1| 0  2  2 7 7 8  10 10 11 11 11 11
2| 0  2  2 9 9 12 14 14 15 15 15 15
3| 0  2  2 9 9 14 16 16 17 17 18 18
4| 0  2  2 9 9 14 16 16 18 18 20 20

What we might notice is that the first row is the running maximum from selling at the prefixes. Then it’s the combination of the previous row and 1 additional transaction. In other words, each row allows you to spend money from the previous row and sell at the current price.
Putting this into action we get the following:

{[k;p]last {maxs x+maxs y-x}[p]/[k;p*0]}

We can look at intermediate rows by writing the scan version of the code and we know it will converge to the unlimited transaction case:

{[p]{maxs x+maxs y-x}[p]\[p*0]} p
0 0 0 0 0 0 0 0  0  0  0  0 
0 0 3 3 3 3 3 8  8  11 11 11
0 0 3 3 5 5 6 11 11 15 15 15
0 0 3 3 5 5 7 12 12 18 18 18
0 0 3 3 5 5 7 12 12 19 19 20
0 0 3 3 5 5 7 12 12 19 19 21

What’s really nice about this code is that it actually describes the problem quite well to the computer and in effect finds exactly what we want.
p*0 / assume we have 0 transactions and therefore 0 dollars at each step.
y-x / subtract the current price(x) from the current profit(y) element wise
maxs y-x/ find the rolling max revenue
x+maxs y-x / add back the current selling price
maxs x + maxs y-x / find the rolling max profit
repeating this k times finds the max profit of the k’th transaction.

Here is the python code that implements this logic:

import itertools
from operator import add,sub
def max_profit(k: int, prices: List[int]) -> int:
        if k==0 or len(prices)<2:
            return 0
        z=[0]*len(prices)
        maxs=lambda x: itertools.accumulate(x,max)
        addeach=lambda x,y: map(add,x,y)
        subeach=lambda x,y: map(sub,x,y)    
        for i in range(k):
            z=maxs(addeach(prices,maxs(subeach(z,prices))))
        return list(z)[-1]

I think this post covers this puzzle, let me know if there are interesting variations I haven’t explored.

Stable Marriage in Q/KDB

A friend of mine is applying to Medical Residency programs, which motivated me to investigate the matching algorithm used to match residents with hospitals. NRMP the organization behind the matching process has a dedicated video explaining how the algorithm works. The simplest version of the problem can be thought to consist of n guys and n gals who are looking to pair off. Each gal has a ranked list of guys and each guy a ranked list of gals. We want to find a pairing such that no guy and gal can pair off and be better off. We call this stable, since any gal who wants to leave can’t find a guy who isn’t currently happier with someone else. It takes two to tango, or in this case disrupt a stable pairing.

Gale-Shapley published a result on this problem in 1962 showed a few amazing properties:

  1. A stable pairing always exists
  2. There is an algorithm to get to this pairing in O(n^2)
  3. It is truthful from the proposing side, i.e. if men are proposing then they can’t misrepresent their preferences and achieve a better match
  4. The proposing side gets the best stable solution and reviewing side gets the worst stable solution
  5. The reviewing side could misrepresent their preferences and achieve a better outcome

Knuth wrote about this problem pretty extensively in a French manuscript. The University of Glasgow has become a hub for this problem, Professor Emeritus Robert Irving wrote the book on the topic and his student David Manlove has written another. New variations and approaches keep coming out, so in a way this problem is still an active area of research. For this post I want to look at only the simplest version and describe a vectorized implementation.

First let’s walk through an example by hand (note: lowercase letters are men)

Men ranking women

d| `D`B`C
b| `B`C`D
c| `C`D`B

Women ranking men

D| `b`c`d
B| `b`c`d
C| `d`b`c

First we pair off d,D
Then we pair off b,B
Finally we pair c,C

If we had started from the women:

First we pair D,b
Then we pair B,b and drop D to be unpaired, because b prefers B to D
Then we pair D,c
Finally we pair C,D

So we see there are two stable pairings. B always pairs with b, but either the men (d,c) or the women (D,C) will get their worst choice.

The algorithm:

Propose a pairing and engage the pair:

  • If the reviewing party (woman) prefers the current proposing party to her current engagement:
    1. Jilt the current party (A jilted party goes back into the pile of unengaged proposing parties, with all the parties he has proposed to so far crossed off)
    2. Become engaged to the proposing party (A reviewing party always prefers any partner to being alone)
  • If the current proposal did not work:
    1. Continue down the ranked list of the proposing party until they find a match, i.e. go back to ste

Here is how we can implement the naive version of this algorithm in Q/KDB:

Here is how we can implement the naive version of this algorithm in Q/KDB:
i:3 /number of each party
c:0N /number of ranked individuals on each list 
     /(null takes a random permutation, convenient since we want everyone to rank everyone) we can also use neg[i]
m:neg[i]?`1 /choose i random symbols with 1 char to represent each possible man
w:upper m /assign same symbols but upper case as list of all possible women
dm:m!(#[i;c]?\:w) /create a dictionary of male preferences
       / each man chooses a random permutation of women
dw:w!#[i;c]?\:m /each women chooses a random permutation of men
t:1!([]w:w;m:`;s:0W) /our initial pairing table, each women is unpaired, 
	          /so her current score is infinity, 
	          /the table is keyed on w, so that a women can only have 1 match
ifin:{$[count[x]>c:x?y;c;0W]} 
 /ifin returns the position in the list like find(?), 
 /but returns infinity if not found instead of length of list
 /example 
 /ifin[`a`b`c;`b] --> 1 
 /ifin[`a`b`c;`d] --> 0W 
The algorithm
f:{[t]
 if[null cm:first key[dm] except (0!t)`m;:t];
 im:first where t[([]w:k);`s]>s:dw[k:dm cm]iifin\: cm;
 t upsert (k im;cm;s im)};

f takes a pairing table of the current pairings. It then looks for the first unmatched man. We call this (c)urrent (m)an, if current man is null, we are done all men are matched, return t, our pairings.

Otherwise, find the first woman on his list of ordered rankings whose happiness (s)core he can improve. We are using iifin to see where his rankings score from the women’s side, and by choosing the first such match, we are preferencing his choice. We then insert this pair into the pairing (t)able, along with the happiness (s)core.
f over t /repeat the process of adding pairs into the table until all men are matched.

Before we look at the vectorized (batched) version, there are some simple optimizations we can make. For example, since we notice that we will be calculating how much a woman likes a man back when he chooses her, we can pre-tabulate all pairs in advance.

Here is the code that creates a table that will have every man and woman pairing with both his and her score next to it:

/first convert the dictionary to a table that has each man and his rankings
/along with a column which ranks the match, 
tm:{([]m:key x;w:value x;rm:(til count ::)each value x)} dm
/looks like 
m w     rm   
-------------
m E M P 0 1 2
e P M E 0 1 2
p M E P 0 1 2
/then ungroup looks like this:
tm:ungroup tm 

m w rm
------
m E 0 
m M 1 
m P 2 
e P 0 
e M 1 
e E 2 
p M 0 
p E 1 
p P 2 

repeat the procedure with women's table. 

tw:ungroup {([]w:key x; m:value x;rw:(til count ::) each value x)} dw

/now left join each match 
ps:tm lj `w`m xkey tw
m w rm rw
---------
m E 0  2 
m M 1  0 
m P 2  1 
e P 0  2 
e M 1  2 
e E 2  1 
p M 0  1 
p E 1  0 
p P 2  0 

Now that we have pre-computed all the scores we can look at the vectorized version:

First, we notice that we can technically make proposals from all the men in one batch. There is one caveat, if two or more men rank one woman as their top choice during a matching batch the tie is broken by the woman. So we can grab each man’s current first choice and break all the ties with that woman’s ranking.

Second, we notice that any woman who is currently engaged will stay engaged unless a man she desires more becomes available. This means we can eliminate all women who are already engaged to someone they prefer more from each man’s ranking. In other words, if Alice is currently engaged to Bob, and she prefers only Carl to Bob, Sam doesn’t stand a chance and we can eliminate Alice from Sam’s ranking completely.

These two tricks are all it takes to vectorize(batch) the solution.

Now let’s look at the code:

Here is the algorithim in one block:

imin:{x?min x} /get the index of min element
smVec:{[ps;t]ms:exec last s by w from t;
 update check:(check+1)|min rm by m from ps where not m in (0!t)`m,rw<ms[w];
 p:select m:m imin rw, s:min rw by w from ps where rm=check;t upsert p}

Let’s break it down statement by statement:

smVec:{[ps;t]
/
/ smVec takes a data table which are our (p)recomputed (s)tatistics 
/ it also takes the pairs table t, which is the same as above
/
 ms:exec last s by w from t;
/
/ms is our current best score for each woman who is currently engaged
/
 update check:(check+1)|min rm by m from ps where not m in (0!t)`m,rw<ms[w];
/ For each man we keep a counter of who we should check next, 
/ we increment this each time he is not yet engaged (not m in (0!t)`m) 
/ so we look at his next viable candidate, 
/ also we remove any candidates that he cannot satisfy (rw<ms[w]). 
/ He needs a woman who prefers him to her current match.
/ So we take the max of either the next candidate 
/ or the next one whose score would go up if he matched.
 p:select m:m imin rw, s:min rw by w from ps where rm=check;
/ now we simply select all the ones to check and break ties by min rw,
/ ie the woman's rank.
 t upsert p}
/lastly update/insert current matches into the pairings table

We need a stopping condition so that we can stop iterating:

Originally, I thought we need a condition that said something like “as long as there are unmatched men, keep iterating. {not all key[dm] in (0!x)`m}”

pairings:smVec[update check:-1 from ps]/[{not all key[dm] in (0!x)m};t]

As it turns out, once we remove all the clutter and always append only viable matches this turns out to be unnecessary. We can simply iterate until convergence which is the default behavior of over(/). So running the function looks like this:

pairings:smVec[update check:-1 from `ps] over t

We are done, but we can extend this one step further by allowing for incomplete preference rankings. The canonical way to treat incomplete ranking is to assume that a person would rather be a widower than be married to someone he didn’t rank.


To do this, we simply add the person matching with himself at the end of every ranking and add a shadow woman which is just the man ranking himself first. This means that once we get to the end of his ranked list he will automatically match with himself. With this small addition to the data, the complete setup function looks like this:

setup:{[i;c]c:neg c;m:neg[i]?`4;w:upper m;
 dm:m!(#[i;c]?\:w),'m;dw:w!#[i;c]?\:m;
 dw,:(m!0N 1#m);t:1!([]w:w,m;m:`;s:0W);
 tm:{([]m:key x;w:value x;rm:(til count ::)each value x)} dm;
 tw:{([]w:key x; m:value x;rw:(til count ::) each value x)} dw;
 ps:ej[`m`w;ungroup tm;ungroup tw];
 `dm`dw`t`ps set' (dm;dw;t;ps)}
setup[1000;0N] /ON means rank every member. 
pairings:smVec[update check:-1 from `ps] over t

Now we can verify that we have the right solution by comparing to the official python implementation

/write out some test data for python
setup[1000;0N]
pairings:smVec[update check:-1 from `ps] over t
`:m.csv 0: "," 0: ungroup {([]m:key x;w:-1 _' value x)}dm
`:w.csv 0: "," 0: ungroup {([]w:key x;m:value x)}count[dm]#dw

/python solve: (we need to increase the recursion limit in python because the library can't solve large problems without it, also I wasn't able to run problems bigger than 2000 on my machine without a segfault)
/
import pandas as pd
from matching.games import StableMarriage
import sys
sys.setrecursionlimit(50000)  
m=pd.read_csv("m.csv") 
w=pd.read_csv("w.csv") 
md=m.groupby("m")['w'].apply(list).to_dict()  
wd=w.groupby("w")['m'].apply(list).to_dict()  
game=StableMarriage.create_from_dictionaries(md,wd)
s=game.solve()
pd.DataFrame([(k, v) for k, v in s.items()], columns=["m","w"]).to_csv("mwsol.csv", index=False)

\
pysol:("SS";enlist ",") 0: `mwsol.csv
pysol:pysol lj update mm:m from pairings
0=count select from pysol where mm<>m

If we join on the women and our men match the python men, then we have the same answer.

Lastly, a note on performance: we get a 3x performance boost from vectorization
The naive version solves the
\t pairs:f over t
5000 -> 185771 (3 min)
2000 -> 15352 (15 sec)
1000 -> 2569 (2.5 sec)
\t smVec[update check:-1 from `ps] over t
5000 -> 59580 (1 min)
2000 -> 4748 (5 sec)
1000 -> 1033 (1 sec)

The python version takes 20 seconds when n is 1000 and I couldn’t test higher.
Which is about 7x slower than naive in q, and 20x slower than vectorized.

Further the python library doesn’t allow for incomplete preference lists. (we can create complete lists that have sentinels values where the list ends, but the memory usage is prohibitively expensive for large problems). Lastly, the python library runs into the recursion limit, which we avoided by iterating over a state data structure instead of using the stack.

Meanwhile in practice there are about 40k med students a year and each only ranks less than 100 choices. Which means we can solve that problem really quickly.
setup[100000;100]
\t pairings:smVec[update check:-1 from `ps] over t
1465 (1.5 seconds)

In the next post, I will describe how to extend this solution to apply the stable marriage problem to resident hospital matching problem where each hospital may have more than 1 slot.

It’s Flooding on all sides!

This post is inspired by a happy coincidence, I was doing p2 of day 14 of advent of code for 2017 and I had never formerly encountered the flood fill algorithm. This confluence of events inspired me to come up with a new way to solve this classic problem. 

You are given a boolean matrix and told that points are neighbors if they are either next to each other horizontally or vertically, but not diagonally. If point (a) and point (b) are neighbors and point (c) is a neighbor of point (b) then (a),(b) and (c) are in the same neighborhood. A point with no neighbors is a neighborhood of 1.  You are asked how many such neighborhoods exist. 

Shamelessly borrowing the picture from adventofcode:

11.2.3..-->
.1.2.3.4   
....5.6.   
7.8.55.9   
.88.5...   
88..5...   
.8...8..   
8888888.-->
|      |   
V      V   

You can see that this piece of the matrix, has 9 neighborhoods. Points that are off in the boolean matrix are simply ignored. 

This problem is a classic because it is apparently how that bucket fill tool works in every paint program. Essentially you keep filling all the pixels that are neighbors until you are done. 

Now the classical way to solve this problem is flood fill. Essentially the problem is broken into two pieces:

  • An outer loop that visits every point
  •  An inner loop that looks for neighbors using breadth first search or depth first search

Here I will annotate the Q code that achieves this result.

ff:{[m] /this is flood fill function
	/ returns a table with all the distinct neighborhoods
	/m is a matrix
	[m] 
	/c is the number of rows
	/columns this matrix has, we assume it's square
	c:count m;
	/we flatten m; m is now a list of length c*c	
	m:raze m; 
	/we create t, a table that is empty
          /and has columns id, and n 
	 /id will be the id of the neighborhood
	 /n will be a list of neighbors in that hood
	t:([id:()]n:());
	/gni is a helper function that get neighbor indexes
	/gni will accept the index as an offset 
	/ from the flattened matrix 
	/and will return coordinates as flat offsets.
	/internally it needs to represent 
             /the coordinates as a pair,
	/ we use standard div and mod to represent the pair 
	gni: { [x;y]
	  /x is the size of the matrix it will be fixed to c
	  /y is the point
	  /s whose neighbors we want to find
	  /ny are the new col coordinates
	      / found by dividing the point by c
	      / we add the offsets 
	      / but makes sure we are within boundaries of matrix
	      / min 0  and max c-1
    
    ny:0|(x-1)&0 -1 1 0+/:y div x;
	
		  /repeat this process to calculate row coordinates
    
    nx:0|(x-1)&-1 0 0 1+/:y mod x;
    
      /get back original flattened index
       /by multiplying ny by c and adding nx
      /flatten list and only return distinct points
	 distinct raze nx+x*ny
	 /project function onto c, since it will be constant 
	 / for the duration of the parent function 
	      }[c]; 

   /f is our breadth first search function
   /we start by adding i, which is our first point, 
   /we then find all the gni of i,
   /we remove any gni[i] that are already present in i
   /and intersect the new points that are on (=1)  
   /by intersecting with j
   /we project this function on gni, 
   / which is called gni inside as well.
   /we also project on the points that are on 
   /(are equal to 1) aka (where m)
   / and which are named j inside the function)
	 f:{[gni;i;j](),i,j inter gni[i] except i}[gni;;where m];
   /repeating f over and over will yield
   /  all the neighbors of a point that are on (=1)

   /now we get to the outer loop, 
    /which will check 
    /if we have seen the current point  (p)
   /if we have seen it, we ignore it,
    /and return our current table (t)
   /otherwise, we call f on that point and return 
    /the point and all of it's neighbors
   /this will end up with our new table t, 
   /we only call this function on the points 
   	/that are turned on, 
   /which means we can skip checking if the point is on  
  /we will repeat this for all the points equal to 1      
	t:{[f;m;t;p]$[not p in exec n from t;
						t upsert (p;f over p);
						t]}[f;n]/[t;where m];
  }

The entire code without comments looks like this, and is pretty short.

ff:{[m]
	c:count m;m:raze m; t:([id:()]n:());
	gni:{distinct raze (0|(x-1)&-1 0 0 1+/:y mod x)+x* 0|(x-1)&0 -1 1 0+/:y div x}[c];
	f:{[gni;i;j](),i,k where j k:gni[i] except i}[gni;;m];
	t:{[f;m;t;p]$[not p in raze exec n from t;t upsert (p;f over p);t]}[f;n]/[t;where m]
	}

With that out of the way, I now want to explain an alternate way of achieving something essentially identical. 

A common technique in math and engineering fields is if you can’t solve a problem, find a similar problem that is easier that you can solve and solve that first.

So I pretended that all I needed to do was to find runs, which is a set of consecutive 1s in a row. I figured if I do this for all the rows and all the columns, I’ll have a starter set of neighborhoods and I can figure how to merge them later. 

Finding a run is really simple:

  1. set the neighborhood id to 0
  2. start at the beginning of first row
  3. look at each element,
  4. if you see a 1 add element to the list that corresponds to your current neighborhood id
  5. if you see a zero increment the neighborhood id
  6. if next row, go to the next row, increment neighborhood id, otherwise go to step 8 
  7. go to step 3
  8. remove neighborhood ids with no elements in them. 

In q, you can take a bit of a shortcut by asking each row, for the indexes with a 1 in them. So for example:

q)where 1 0 1 1 0
0 2 3
q) deltas where 1 0 1 1 0
0 2 1

One thing we notice is that consecutive differences  of 1 represent runs and those with any other number represent cut points, and if we cut at those points, we would get  (,0; 2 3) which is a list that contains two lists, each inner list is a run in that row. We can repeat this for the columns and create all the runs that exist row wise or column wise. 

Now we just need some logic to merge these runs. Here is an annotated version of the code that implements this method of finding all the neighborhoods. 

rr:{[m] /This will return all the points in the matrix that are on and their assigned neighborhood id
	/m is the square matrix
	c:count m; /c is the length of a side of the matrix 
	m:raze m; /m is now the flattened list that corresponds to the matrix
	/runs is a function that will return a list of lists, 
	/each inner list corresponds to points that are adjacent
	/Though the mechanism to calculate adjacency is in terms of y
	/It will return the Xs that correspond to those Ys. 
	/eg if we have a row 0 1 1 0 1 0 1 1, and the points are labeled `a`b`c`d`e`f`g`h
	/runs[0 1 1 0 1 0 1 1;`a`b`c`d`e`f`g`h] => (`b`c;,`e;`g`h)
	runs:{(where 1<>-2-': y) _  x}; 
	/now we want to apply runs to every row and column
	/n will be our initial list of neighborhoods,
	/ each neighborhood is a list that will contain adjacent elements either by row or by column
	/ elements are given a unique id
	/i will be our unique id , it is the index of the point that is turned on (=1)
	i:where m;
	/j groups all 'i's that are in the same row, by using integer division
	/so j is a dictionary whose keys are the row index and whose values are the 'i's 
	/we need to do this grouping to ensure that only points in the same row are considered adjacent
	/ otherwise we might believe that a diagonal matrix has runs (because we only see the column indexes)
	j:i group i div c;
	/since runs returns x, but uses 'y's to do the splitting, we provide j mod c as y
	/j mod c corresponds to the column location of each i
	/we then  apply runs on (')each j and j mod c. 
	/finally we flatten the result with a raze so that we just a have the list of row neighborhoods
	n:raze runs'[j;j mod c]; 
	/now we will repeat with the columns
	/we want to permute the list i so that the elements appear in column order
	/we want to preserve the labels we have in i
	/so we generate a list that is the length of m (c*c),
	/we reshape (#) to be a square matrix, and transpose it along the diagonal (flip)
	/we flatten(raze) it back out and intersect it with i, 
	/this preserves the column order we achieved from doing the flip, 
	/but only keeps the elements that are in i
	/k are the elements of i permuted along columns
	/eg if we have a matrix 
	//           0 0 1 
  	//           1 0 0
  	//           0 1 0
  	// i would be 2 3 7
  	// and k would be 3 7 2
	k:(raze flip (c;c)#til c*c) inter i);
    /now we will create the j  again which is our grouping 
    /but this time according to the column index so we will group by the integer remainder (mod)
    /this will separate our indexes into columns
    j:k group (k mod c);
    /then we will apply our runs function on each column
    / we use the div to get the row index of each element, 
    / so that runs can tell if two points are vertically adjacent
    / we flatten the result and append it to n
    n,:raze runs'[j;j div c];
    / at this point we have a list of all possible neighborhoods, but there are overlaps
    / so we create a table t, which will assign an id for each neighborhood contender
    / ungroup flattens the table so that every element belongs to a neighborhood id
    // here is a sample of what the table looks like before and after ungroup
    / before ungroup:
	/				n         id
	/				------------
	/				0 1       0 
	/				,3        1 
	/				,5        2 
	/				8 9 10 11 3  
	/after ungroup
	/				id n 
	/				-----
	/				0  0 
	/				0  1 
	/				1  3 
	/				2  5 
	/				3  8 
	/				3  9 
	/				3  10
	/				3  11
    t:ungroup `id xkey update id:i from ([]n);
    / f is our merging function.
    / It will merge neighborhoods together, 
    / by giving them the same id if there are points in common.
    / First we will update/create a column p which is the minimum id for a given element (n)
    / Then we will update all of the ids the with minimum p and rename that the new id
    / If two neighborhoods share a point in common, that point has two neighborhood id's
    / we will assign it primary id (p) which is the min of those two ids
    / we then group by id and assign the id to be the min of the primary ids (p)
    f:{[t]update id:min p by id from update p:min id by n from t};
    /On each iteration of this function we will merge some neighborhoods, 
    / we repeat until we can't merge anymore
	f over t}

The uncommented code is much shorter and looks like this:

rr2:{[m] 
c:count m;m:raze m; runs:{(where 1<>-2-': y) _ x};
n:raze runs'[j;(j:i group (i:where m) div c) mod c];
n,:raze runs'[j;(j:k group (k:(raze flip (c;c)#til c*c) inter i) mod c) div c];
t:ungroup `id xkey update id:i from ([]n);
f:{[t]update id:min p by id from update p:min id by n from t};
f over t}

At this point we are done and we have solved the puzzle in two slightly different ways. Looking at the performance of both methods, reveals that they are both O(n) essentially linear in the number of elements. However, the row method is actually faster, because everything is done as nice straight rows, so there is not much jumping around, in other words it is more cpu cache friendly. On my late 2011 Mac: for a 128 by 128 matrix
\t rr m
176 milliseconds
\t ff m
261 milliseconds

So it is about 66% faster

As a bonus, my father suggested that I expand this code into n dimensions. This is rather easy to do and I include the code here without too many comments. The trick is taking advantage of KDB’s vs (vector from scalar) function that can rewrite an integer as a number in another base:
so for example 3 vs 8 => (2;2) which are the coordinates of the point 8 in a 3 by 3 matrix. The point 25 is 2 2 1, in other words it’s in the last layer of a 3 dimensional cube of size 3, last column and first row. 
This means we can easily go back and forth between integer representation of the matrix and coordinate representation. Everything else is pretty much the same. 

ffN:{[m]
c:count m;r:1; while[(count m:raze m)<>count[m];r+:1]; t:([id:()]n:());
gniN:{[r;c;n]a:neg[a],a:a where 1=sum each a:(cross/)(r;2)#01b;
distinct raze -1 _ flip c sv flip (c-1)&0|a +\: c vs n,-1+prd r#c}[r;c];
f:{[gni;i;j](),i,j inter gni[i] except i}[gniN;;where m];
ps:where m;
while[count ps;t:t upsert (first ps;k:f over first ps); ps:ps except k];
/functional version is slower
/t:{[f;t;j]q+::1;$[not j in raze exec n from t;t upsert (j;f over j);t]}[f]/[t;where m]
t}

rrN:{[m]
c:count m;r:1; while[(count m:raze m)<>count[m];r+:1]; runs:{(where 1<>-2-': y) _ x};
j:(c sv (til r) rotate\: c vs til prd r#c) inter\: where m;
g:til[r] except/: d:reverse til r;
n:raze raze each runs''[j@'k;@'[v;d]@'k:('[group;flip]) each @'[v:flip c vs j;g]];
t:ungroup `id xkey update id:i from ([]n);
f:{[t]update id:min p by id from update p:min id by n from t};
f over t}


Controlling the Tower

I recently came across this problem and it looked interesting enough, so I solved it. It’s from Topcoder, which is nice because they provide test cases so you can be sure that you have solved it.

Unfortunately, Topcoder has dracanion copyright laws, so I will paraphrase the problem in case the link to the description dies. You want to control some territory, the territory has t (an integer) towers and each tower is a certain number of stories s (an integer). To control the territory you must control the majority of the active towers. A tower is active if someone controls the majority of stories in a tower. If a tower has no majority controller, it is inactive and does not count. What is the minimum number of stories you must win to guarantee control of the territory?

The first thing that helped me was converting the test cases into the following two lists and created a table:

inputs:((47);(9;9);(1;1;1;1;1;1);(2;2;2;2;2;2);(1;2;3;4;5;6;7;8;9);(1;2;3;4;5;6;7;8;9;10);(1;1;100;100;100;100;200;200);(2);(999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999;999);(2;3;9);(1);(3);(4);(5);(6);(1;1);(1;2);(2;2);(1;3);(1;4);(2;3);(2;4);(3;3);(100;100);(1;1;1);(1;2;1);(2;2;1);(2;2;2);(10;100;900);(9;99;999);(2;8;4);(2;2;2;2);(2;2;2;2;4);(784;451;424;422;224;478;509;380;174;797;340;513;842;233;527;342;654;816;637);(342;574;20;606;413;479;51;630;802;257;16;925;938;951;764;393;766;414;764;208;959;222;85;526;104;848;53;638;969;214;806;318;461;950;768;200;972;746;171;463);(877;587;587;325;528;5;500;714;548;688;243;5;508;783;561;659;297;44;70;131;329;621;374;268;3;113;782;295;902);(159;797;323;629;647;941;75;226);(528;259;717;920;633;957;712;966;216;812;362;146;472;462;140;743;672;512;232;16;186;721;171;831;641;446;183;483;198;55;610;489;884;396;344;499;368;618;812;211;771;449;289);(608;405;630;751;288;836))
results:(24;14;4;7;36;45;699;2;37451;12;1;2;3;3;4;2;3;3;4;5;4;5;5;150;2;3;4;4;955;1053;11;5;8;7806;18330;10838;3404;18074;2866)
t:([]inputs;results)

q)select results, inputs from asc t  / just to give a visual sample

results inputs
———————————–
1      1
2      2
2      3
3      4
3      5
4      6
24     47
2      1 1
2      1 1 1
4      1 1 1 1 1 1
699  1 1 100 100 100 100 200 200

That way, I could write a function and start testing and seeing if I was making progress.

If there is only one tower, then I obviously need to take it, which means I need to take the majority.

So that is pretty easy:

f{[x] if[count[x]=1; :sum 1+floor x%2]; :0} /one tower,
/ otherwise return 0 this is wrong but we want to return a number

I can run my function and see which cases I still haven’t solved:

q) select myResult, results, inputs from (update myResult:f each inputs from t ) where myResult<>results

This gave me a way to check if I was making progress.

Next, we know that we need to win the majority of the towers and we need to win them in the most expensive way possible in order to guarantee a victory because we don’t get to chose how our stories our distributed. So we sort the towers desc by height and fill up half of them. Then comes the tricky part, the opposition can have a simple majority in all the other towers. In that case, there will be one final tower that decides everything.

For example, suppose there are 5 towers, 6 6 5 5 4. We need to win three towers. So we win the first 2 by getting 12 stories. Then we only need one more tower but the opposition can strategically win the shortest towers. It’s important to note that winning a 4 story tower and a 5 story tower takes 3 either way. So they can do better by winning the two 5 story towers and simply stop me from winning the 4 story tower by taking 2 stories.

This explains the following slightly convoluted algorithm:

f:{
if[count[x]=1;:sum 1+floor x%2]; /one tower
t:floor count[x]%2; / majority of towers, one less if there are an odd number
v:sum t#x:desc x; /most expensive way to win t towers
r:reverse (t-count[x])#x; /remaining towers
o:sum 1+floor %[;2] t#r; /opposition needs to take at least t towers
o+:sum floor 0.5+%[;2] d _ r; /oppositions needs to prevent the rest
o-:(any(d _ r)mod 2)*(any not(d#r)mod 2) ;
/subtract if opposition wins during preventing
/ and you can strategically exchange an odd tower with an even one
v+1+sum[r]-o
};

If anyone has a simpler, more straightforward approach, or wants to tell me how my approach is wrong, I would love to hear about it.

A Better Way To Load Data into Microsoft SQL Server from Pandas

This post is not going to have much theory.

Here was my problem. Python and Pandas are excellent tools for munging data but if you want to store it long term a DataFrame is not the solution, especially if you need to do reporting. That’s why Edgar Codd discovered, and Michael Stonebreaker implemented, relational databases.

Obviously, if I had the choice I wouldn’t be using Microsoft SQL Server [MSS]. If I could, I would do this whole exercise in KDB, but KDB is expensive and not that good at storing long strings. Other relational databases might have better integration with Python, but at an enterprise MSS is the standard, and it supports all sorts of reporting.

So my task was to load a bunch of data about twenty thousand rows — in the long term we were going to load one hundred thousand rows an hour — into MSS.

Pandas is an amazing library built on top of numpy, a pretty fast C implementation of arrays.

Pandas has a built-in to_sql
method which allows anyone with a pyodbc engine to send their DataFrame into sql. Unfortunately, this method is really slow. It creates a transaction for every row. This means that every insert locks the table. This leads to poor performance (I got about 25 records a second.)

So I thought I would just use the pyodbc driver directly. After all, it has a special method for inserting many values called executemany. The MSS implementation of the pyodbc execute  many also creates a transaction per row. So does pymssql.
I looked on stack overflow, but they pretty much recommended using bulk insert.Which is still the fastest way to copy data into MSS. But it has some serious drawbacks.

For one, bulk insert needs to have a way to access the created flat file. It works best if that access path is actually a local disk and not a network drive. It also is a very primitive tool, so there is very little that you can do in terms of rollback and it’s not easily monitored. Lastly, transferring flat files, means that you are doing data munging writing to disk, then copying to another remote disk then putting the data back in memory. It might be the fastest method, but all those operations have overhead and it creates a fragile pipeline.

So if I can’t do bulk insert, and I can’t use a library. I have to roll my own.

It’s actually pretty simple:

 

 

You create your connection, I did this with sqlalchemy but you can use whatever:


import sqlalchemy
import urllib
cxn_str = &quot;DRIVER={SQL Server Native Client 11.0};SERVER=server,port;DATABASE=mydb;UID=user;PWD=pwd&quot;
params = urllib.quote_plus(cxn_str)
engine = sqlalchemy.create_engine(&quot;mssql+pyodbc:///?odbc_connect=%s&quot; % params)
conn = engine.connect().connection
cursor = conn.cursor()

You take your df.
I’m going to assume that all of your values in the df are strings, if they are not.
And you have longs, you are going to need to do some munging to make sure that when we stringfy the values sql knows what’s up.
(so if you have a long, then you need to chop the L off because MSS doesn’t know that 12345678910111234L is a number. here is the regex to get rid of it: re.sub(r”(?<=\d)L”,”,s ) credit to my coworker who is a regex wiz.

We need to tuplify our records, Wes Mckinney says:

records = [tuple(x) for x in df.values]

Which is good enough for me. But we also need to stringfy them as well.

So that becomes:

records = [str(tuple(x)) for x in df.values]

MSS has a batch insert mode that supports up to 1000 rows at a time. Which means 1000 rows per transaction.

We need the sql script that does batch insertion: we’ll call that insert_

insert_ = &quot;&quot;&quot;

INSERT INTO mytable
(col1
,col2
,col3
...)
VALUES

&quot;&quot;&quot;

Since we can only insert 1000 rows. We need to break up the rows into 1000 row batches. So here is a way of batching the records.
My favorite solution comes from here.

def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

Now all we need to do is have a way of creating the rows and batching.
Which becomes:

for batch in chunker(records, 1000):
rows = ','.join(batch)
insert_rows = insert_ + rows
cursor.execute(insert_rows)
conn.commit()

That’s it.
This solution got me inserting around 1300 rows a second. A 1.5 orders of magnitude increase.

There are further improvements for sure, but this will easily get me past a hundred thousand rows an hour.

Dosage Search [Part 2]

So since publishing the first post on this topic. I have researched this question.

First I need to acknowledge, that my CS friends have all pointed out the similarity between this question and the two egg puzzle.

You are given two eggs and placed in a 100 floor building. You need to find the highest floor where the egg could survive a drop.

If you only have one egg, then you obviously just go up one floor at a time. Eventually the egg breaks or you reach floor 100. In first case, the floor below is the answer. In the second,  floor 100, since there are no higher floors.

If you have 10 eggs. You can just do binary search and find the right floor very quickly.

Since you have two eggs you can do better than one egg. The post I linked to earlier has a solution that tries to ensure the same number of drops as the worst case of binary search. Another approach is to drop the first egg on 2^n floor. So the first time would be from floor 2 then from floor 4, then from floor 8 and that way quickly climb up the building. This is similar to how network connections back off when a service is not responding. Retrying less frequently as the number of retries goes up.  (This allows the service to restart without falling over from the bombardment of ever more frequent requests)

As similar as this egg approach is, it fails to capture something that I think is essential to the dosage question. That is that, there is a difference between making a mistake of dropping the egg the fifty floors too high and making a mistake by one floor.  — Even if during the search you can’t tell, as that would be an obvious way to find the result in two tries.

So, I ran the code from the previous section and got some result for lists that had up to 1000 elements. However, python started to take too long for larger lists.

This prompted me to rewrite the functions into q. q and it’s sibling k, is an obscure language except to those in the financial services industry. It was created by a genius, Arthur Whitney, whose interview speaks for itself. The language itself is a derivative of APL and Lisp with a few more good ideas thrown in for good measure.  (More on this language in a different post)

Anyway the same algorithm as before which in python took 30+ hours ran in 20 minutes in q.

The relative speed of the k code. Allowed me to see a pattern for the divisor. As the list grows longer, the divisor that we use to split the list should get larger, albeit slowly.

Data (More at the bottom)
Number of elements in list divisor Average cost
50 5 7.78
100 6 11.48
500 13 27.128
1000 17 38.85
4000 31 78.87175
5000 35 88.3382
10000 49 125.4369

 

Since the divisor wasn’t growing for lists that were similar in size, I realized the mistake I made by keeping the divisor constant. Obviously as the list gets shorter we need to update the divisor that we are using to divide the list, since the problem has been reduced to searching a smaller list.

 

This led me to create another type of binary search with an updating divisor. The divisor grows proportional to log(n)^3.

This also increased the speed of the search since I was no longer doing linear search on any list smaller than the divisor I was testing. To explain: if you have a static divisor, then when you start you have a list of 1 million elements. So you divide by 200 and you know if the item you are searching for is in the first range [0:5,000) or in the second range (5000: 1,000,000]. However, gradually the list gets smaller, but you keep dividing the list in this very uneven way, so that when the number of elements is less than 200, you keep looking at the first element. This is equivalent to linear search.

If instead we start to divide the list by smaller divisors, then we can get closer to binary search and since much of the list is eliminated our chances of making a huge mistake are also smaller.

Here is the q code with the two versions of search: (this might be unreadable to most people but I plan on having a tutorial series on k, q, kdb soon)

dif_cost: {:x-y};
trivial_cost: {x+y; :1};
scalar_cost: {x+y; :100};
/binary search with cost function
st: {[l;item;cf;d] cost:0; while [((count l) > 1); m: (count l) div d; $[l[m]<item; [cost+:1; l: (m+1)_l]; l[m]>item; [cost +: (cf[l[m]; item]); l:m#l]; :cost+:1];]; cost};
/calculates average cost for a particular divisor and n, by searching for each element 
/then avg all those costs
st_avg: {[n; d] i: til n; res: @[i; i; st[i; ; dif_cost; d]]; avg res };
 
/iterates over divisors only going from 2 to 10 *( floor (log (n)))
/example for 100 it goes from 2 3 4 … 37 38 39 
/ this covers the divisors that are relevant and minimizes the cost and removes unneeded computations
st_div: {[n] d: $[n<50; 2_til n; 2_til 10 * floor log n]; res: st_avg[n] each d; d[first where res = min res],min res}

 

/Algorithm with updating divisor

s_auto: {[l;item;cf, f] cost:0; while [((count l) > 1); d: max (2, floor (log (count l) xexp f) ); m: (count l) div d; $[l[m]<item; [cost+:1; l: (m+1)_l]; l[m]>item; [cost +: (cf[l[m]; item]); l:m#l]; :cost+:1];]; cost};

/Then we can minimize over f, which is the factor that we exponentiate the log N.

 

 

Data continued:

num d cost
50 5 7.78
100 6 11.48
150 7 14.32667
200 8 16.755
250 9 18.768
300 10 20.72667
350 11 22.49143
400 11 24.15
450 12 25.66
500 13 27.128
550 13 28.51455
600 13 29.85833
650 14 31.12
700 14 32.34143
750 14 33.50267
800 15 34.62875
850 15 35.75529
900 15 36.84667
950 16 37.84526
1000 17 38.85
1050 17 39.83143
1100 17 40.78818
1150 17 41.75652
1200 17 42.7125
1250 19 43.6
1300 19 44.46538
1350 19 45.35333
1400 19 46.18
1450 19 47.02966
1500 19 47.84467
1550 19 48.68581
1600 20 49.46
1650 21 50.25697
1700 21 51.01176
1750 21 51.79314
1800 21 52.54167
1850 21 53.27514
1900 22 54.02211
1950 22 54.73641
2000 22 55.4305
2050 23 56.16927
2100 23 56.82714
2150 23 57.52884
2200 23 58.20273
2250 23 58.88222
2300 23 59.55609
2350 25 60.20553
2400 25 60.85167
2450 25 61.47714
2500 25 62.0944
2550 25 62.74235
2600 25 63.33962
2650 25 64.00113
2700 25 64.59259
2750 26 65.21273
2800 27 65.84
2850 27 66.39614
2900 27 67.0331
2950 27 67.58475
3000 27 68.20133
3050 27 68.73803
3100 27 69.29355
3150 28 69.88635
3200 27 70.44688
3250 28 71.00954
3300 28 71.56636
3350 29 72.11164
3400 29 72.64853
3450 29 73.16986
3500 29 73.70571
3550 30 74.30901
3600 29 74.79778
3650 29 75.31123
3700 31 75.84622
3750 31 76.37627
3800 31 76.87974
3850 31 77.36545
3900 31 77.85179
3950 31 78.39165
4000 31 78.87175
4050 31 79.37975
4100 31 79.88659
4150 31 80.37084
4200 31 80.87167
4250 32 81.35765
4300 31 81.8307
4350 33 82.3223
4400 33 82.81409
4450 32 83.28472
4500 33 83.76067
4550 33 84.21473
4600 33 84.69391
4650 33 85.15935
4700 33 85.60809
4750 33 86.07411
4800 33 86.555
4850 33 86.98887
4900 33 87.45633
4950 35 87.92222
5000 35 88.3382
10000 49 125.4369

 

 

Dosage Search [Part 1]

Sometimes, you come across problems that you think should be well studied but aren’t.

(please correct me if I’m wrong)

Medicine is famous for not having good cross pollination with the maths. See this famous paper which rediscovers the Riemann sum.

However, as bad as medicine is at noticing math, most disciplines are especially bad at adapting the study of algorithms, one of the most theoretical branches of Computer Science. This is true even for seasoned programmers.

There is a reason for this. Brute force works remarkably well. In fact, so well that Ken Thomspon, author of the Unix operating system, said:

“When in doubt, use brute force.” – Ken Thompson

There are some geniuses in the field, notably  Don Knuth. In Coders At Work some of  the best Computer Scientists admit to reading little to none of his series The Art of Computer Programming, saying the math is too difficult.

So, it’s not surprising that medical dosages are done with brute force. There is a method for picking the starting point, called median effective dosage which produces a result in 50% of the population.

However, doctors are averse to either overdosing or underdosing depending on the condition. They tend to start at some point and then proceed straight through the search space until they get to the right level. This algorithm is called linear search. It is brute force and it is guaranteed to find the right level should it exist.

There are much better ways to do a search assuming a sorted list. Let’s create a game to illustrate this:

Suppose I chose a number between 1 and 100 and You had to guess which number I chose. Every time you guessed, you pay me a dollar and I’ll tell you if you are above or below. If you guess correctly I pay you 10 dollars. Clearly, you are not going to start at 1 and just keep guessing up to 100.  Instead you can start at 50 and then eliminate half the range. If you keep halving what’s left you can find any number in 7 tries. Thus earning 3 dollars of profit.

The process of halving the interval is called binary search and it works great when the cost of guessing to high or to low is the same. In this case you paid a dollar for each guess. Suppose though instead, you paid 2 dollars if you guess above and only 1 dollar if you guess below. What strategy should you pursue?

What if you pay the difference  between the guess and the true answer and 1 dollar if it’s below. Obviously, you would pay this at the end of the game so as not to reveal the answer immediately. What does the reward have to be to make this game fair and what is the optimal strategy. In effect, this is the game doctors play when they are prescribing a dosage. Every guess, costs the patient time and health, but there is a higher cost to over prescribing or under prescribing. And this asymmetry can be captured in such a game.

A possible solution is to divide the interval into 4 and then guess that number. Keep doing that and you are guaranteed to get an answer as well and you’ll also guess under about 3/4 of the time. The right way to sub divide the interval and guess depends on the cost of each guess and the cost of guessing over. Is there an answer? Sure. But I haven’t seen a medical journal that has researched this question and so we do brute force, and that’s a problem.

Below is python code that simulates these questions:

def linear_search(alist, item, cost_func):
	first = 0
	last = len(alist)-1
	found = False
	cost = 0
	while first&lt;=last and not found:
		current = first
		if alist[current] == item:
			found = True
		else:
			if item &lt; alist[current]:
				cost = cost + cost_func(alist[midpoint],item)
			else:
				cost = cost + 1
				first = current+1
	return cost+1
def binary_search(alist, item, cost_func, cost=0, divisor=4):
	blist = list(alist)
	if len(blist) == 0:
		return cost
	else:
		midpoint = len(blist)//divisor
		if blist[midpoint]==item:
			return cost + 1
		else:
			if item&lt;blist[midpoint]:
				c = cost + cost_func(blist[midpoint],item)
				return binary_q_search(blist[:midpoint],item, cost_func, c)
			else:
				c = cost + 1
				return binary_q_search(blist[midpoint+1:],item, cost_func, c)
def binary_q_search(alist, item, cost_func, cost=0, divisor=4):
	blist = list(alist)
	if len(blist) == 0:
		return cost
	else:
		midpoint = len(blist)//divisor
		if blist[midpoint]==item:
			return cost + 1
		else:
			if item&lt;blist[midpoint]:
				c = cost + cost_func(blist[midpoint],item)
				return binary_q_search(blist[:midpoint],item, cost_func, c)
			else:
				c = cost + 1
				return binary_q_search(blist[midpoint+1:],item, cost_func, c)
def trivial_cost_func(guess, true):
	return 1
def scalar_cost_func(guess, true):
	return 100
def dif_cost_func(guess, true):
	return (guess-true)
lin = []
binar = []
quar = []
cost_func = scalar_cost_func
a = 1000
for i in xrange(0,a):
	lin.append(linear_search(xrange(0,a),i, cost_func))
	binar.append(binary_search(xrange(0,a),i, cost_func))
	quar.append(binary_q_search(xrange(0,a),i, cost_func,divisor=4))

	print &quot;The average cost for linear search: &quot; + str(sum(lin)/float(a))
	print &quot;The average cost for binary search: &quot; + str(sum(binar)/float(a))
	print &quot;The average cost for quarter search: &quot; + str(sum(quar)/float(a))

Continued in Part 2

 

St Petersburg Paradox:

I learned about this paradox recently through an interview question.

The paradox comes from the following game. more here at wiki:

The casino flips a coin until it gets heads. The number of tails is squared and payed to the player. How should they price this game?

 

Having never seen this question, I proceeded to take the expectation. That is the weighted average of the payouts. Well the answer is infinity, this can be seen intuitively from the fact that the payout grows exponentially at the same rate as the diminishing probability of the event happening.

The paradox is that no one will place huge sums of money to play this game, even though theoretically, they should – since it has an unlimited expected payoff.

The variance of this game is also infinite which stems from the fact that the payouts are finite but the expectation is infinite. E(X) = (X-E(X))^2

There are several resolutions:

A utility function approach which says we need to discount the payoffs by the utility function. In other words, the first million dollars brings me more utility then each next million. However, if this is not bounded, you can simply keep increasing the speed of payouts given any unbounded utility function.

Another approach suggests that people discount the probability of negligible events. However, lottery ticket sales undermine this argument, since those seem to be low probability events that people pay too much for. However, this counter argument neglects to mention that certain events are discounted completely if they are below a certain probability. As an example, the chance that all the air will decide to stay in one side of my tire, nobody will pay any amount of money for that event. Same goes for the law of gravity to stop applying, there is some negligible probability for that event, but no matter how large the payoff no one will by that ticket.

An experimental approach, suggests that you should pay around 10 dollars. Having played the game 2048 times.

Another approach suggests that you would need the casino to be good for an infinite sum of cash and since they aren’t no one would place that money.

A combination of utility and the previous reason, gives an answer that is close to the experimental result. Suppose you would be willing to play and suppose it takes a minute to flip a coin. You have a magic genie that has guaranteed that the coin will flip tails until you say heads after which it will be heads. How long will you play?

Most likely, you will stop after you are the richest person in the world, but that only will take an hour. After that you apparently have more money than the value of earth by three orders of magnitude. If you discount to that best case scenario, you get no paradox at all, in fact the most you would then pay is 1/4*60 = 15 dollars. If you understand, that casino can’t possibly guarantee more than a billion in winnings, that brings the value down to 29.8973529 ~ 30, which says it’s closer 7.5 dollars. If you are playing against another person you can’t expect more than a million dollar payout so you shouldn’t bet more than 1/4*20, unless you are playing against Harry Kakavas.

One last way to think about this problem. What is the expected value of the game itself. The answer to this is 1. That is the total number of flips will be 2 but you will only expect one tail. In which case, you expect 1 dollar. So perhaps you should simply pay 1 dollar because that’s how the game will play out.This is called the median value of the game and is in fact what many people bet.

The fact is, the more money you have the more valuable your time becomes, and it makes less and less sense to keep playing the game. So you will tell your genie to stop, because now that you have earned all this money you want to go out and spend it.