K for Good Thought: [Part 1] k,q,kdb tutorial atoms, nouns

July 29, 2016December 1, 2016 / pindash91 / 2 Comments

I am a new convert to the K/Q/J/APL cult. This makes me suited to write a tutorial for those who find this whole family of languages quite arcane.

The tutorial is called K for Good Thought, an homage to Ken Iverson’s notation as a tool for thought and the tutorials ‘learn x for great good’ meme. Leslie Lamport claims that until you have expressed your thoughts in writing you haven’t really thought. So there is something about being able to express yourself in K that allows for another way of thinking. Often the K solutions tend to be simpler and more general. Also, because K is a computer language interpreted by a machine that does exactly what you told it to do, it’s a much stricter reader of your thinking.

I should start by saying that this is a language that values brevity over almost everything. As Arthur Whitney said in an interview. If you can do the same thing and it’s shorter and not much slower he wants to hear about it. If it’s faster you better tell him.

Arthur Whitney is a man of few words and when a subject is very difficult and the explanation is dense and technical, few will march on. So with that, I aim to make this tutorial approachable for those who have had high school math and no programming experience. If I venture out into jargon or CS concepts or high level math, I will try to explain the concepts first before diving into the syntax, though of course sometimes one will motivate the other. I believe that the less programming background you have the more likely the concepts will become natural to you since you won’t need to unlearn many bad habits.

There are many tutorials that give the history of these languages. So I am not going to bother, except to say that apl was developed to allow programmers to work on the problem at hand and not on the low level implementation, though of course inevitably you will need to get your hands dirty with the inner workings of K as well, but those tend to be advanced topics. –quick credit to all the abridged and short materials by Arthur Whitney [AW] (K, Q, KDB)

Let us start at the beginning:

In the beginning, AW created atoms and they were good.

Atoms are indivisible.
Atoms can be of different types.

I promised not to use jargon, so when I do. You will find an explanation in a quote box like this:
Types are things that allow us to distinguish between different kinds of data. That is, an Atom can be a number or letter or a timestamp. Not everything that you can think of is a type. For instance, perhaps you want to have an emotion as a type and you would use various emojis to represent them.
[AW] has not seen it fit to add them to the types that K understands. However, you can create an emotional type. We will look at that later.

Some example atoms:

5
1b
‘string of letters’
`symbol
2009.11.05D15:21:10.040666000

We cannot do much with atoms, just like we cannot really say much if we just had nouns. I guess we could classify different Nouns: person, place or thing.

K uses @: to interrogate and find out what something is. In q we can use ‘type’

Because K is terse it uses numbers to signal what type something is.

A negative sign in front means that the type is an atom.

@:5 = -7 An integer between negative 2^63 and positive 2^ 63
@:1b = -1 Boolean: Can be (1b) True or (0b) False
@: ‘s’ = -10 A charecter or letter
@:`symbol = -11 A token (advanced) used to refer to one copy
@:2009.11.05D15:21:10.040666000 = -12 A Timestamp

Besides classifying we can make lists:

In English:

Animals: cat, dog, mouse

And in K we can do something similar.

integers: 1 2 3 4 5

animals: `cat `dog `mouse
alpha: (“a”;”b”;”c”)

Here something unexpected happens. While ‘animals’ is a list of symbols, ‘alpha’ is actually a list of characters which is what programmers call a string.

So alpha can also be written as “abc”. In fact, if you ask K to display alpha it will automatically display it as “abc” instead of “a”,”b”,”c”.

If we check what “abc” is

@:”abc” = 10

It’s positive. That’s because if the noun is actually a list then kdb will tell you what it’s a list of.

Oh by the way, I snuck in some notation. In English we often have a colon ‘:’ that tells us that a definition follows. Well K lets you do the same thing.

In K/Q all lists are semicolon(;) separated enclosed in parentheses. As a matter of convenience K /Q lets you define some common lists just by using spaces. So the following are equivalent.

(1;2;3) and 1 2 3

So alpha is defined as a list of characters specifically a, b, c.

Lists in K are always ordered, which means that we can talk about the first or last element in a list.

In fact K has a couple of things things that are designed specifically for lists.

First lets introduce the count (!) in K or the til in q

This allows us to create lists of integers until that number starting from 0.

For example !10

0 1 2 3 4 5 6 7 8 9

til10: !10

til10

0 1 2 3 4 5 6 7 8 9

til10[1] /the second item on list; list start with zeroth item

til10[9] /the tenth item on the list; remember to start counting from 0

Let’s say we want just first 3 numbers

3#til10 /the pound sign or hashtag means ‘take’

0 1 2

-3#til10 /this takes the last 3 slots. The negative sign signifies that we are taking from the end of the list.

7 8 9

3_til10 /The underscore(_) is used to drop. if we only want what’s not in the first 3 slots

4 5 6 7 8 9

(3_til10) = (-7#til10)

1111111b

There is an equivalence between take and drop. Dropping the first 3 is equivalent to taking the last 7, if your list has 10 things. The equals (=) sign is used to compare things each slot from each list will be compared and return a 0 or 1. 1 if it is equal and 0 if the two elements in the slots are not equal.

In general we can say given a list with n items dropping the first x items is equivalent to taking the last n-x items. In our example, we have 10 items so dropping 3 is the same as taking the last 7.

What about reversing the order of a list?

In K it’s ‘|:’ in Q it is just called ‘reverse’.

reverse 7_til10

9 8 7

Finally, we can make lists of lists. In English we call these tables. In math we call them matrices.

m: (1 2 3; 4 5 6)

1 2 3
4 5 6

In K ‘+:’ is flip and in Q it is ‘flip’

in K: “+:m” is the same as “flip m” in Q

+:m
(1 4; 2 5; 3 6)
1 4
2 5
3 6

m[0] Returns the first noun, which is a list of numbers.

1 2 3

Remembering the game battleship helps to see what is happening.

m[0][2]

BONUS SECTION

What if we wanted to make a list of lists of lists. We can! In math these are called tensors. In English these don’t really have a name, I think most people don’t have a use for them.

However, we all encounter these structures in the form of pages. Let me explain, If a line in a book is a list of words (or as k would have characters), then a page in a book is a list of lines and a collection of pages, a book, is a list of lists of lists. One way to reference a word in a book would be to tell you the page number (dimension 1), the line number (dimension 2), and the word number (dimension 3). To make this clear, lets do a pretty simple example. Let’s say that we wanted to write a math textbook and the first couple of pages were reference tables. The first table might be the addition table. The second table could be multiplication.

Let’s first build the addition table.

in K:

a:!10;add:(a+)’a

in Q:

a: til 10; add:(a+) each a;

add is now equal to

0	1	2	3	4	5	6	7	8	9
1	2	3	4	5	6	7	8	9	10
2	3	4	5	6	7	8	9	10	11
3	4	5	6	7	8	9	10	11	12
4	5	6	7	8	9	10	11	12	13
5	6	7	8	9	10	11	12	13	14
6	7	8	9	10	11	12	13	14	15
7	8	9	10	11	12	13	14	15	16
8	9	10	11	12	13	14	15	16	17
9	10	11	12	13	14	15	16	17	18

We can see that indexing add[3;2] will give us the cell containing 5, which is what we would expect. Cell add[2;3] will also equal 5, but I highlighted add[3;2], notice row column order. A good way to remember this is to think of a regular list as a vertical structure. If the list has lists inside, then to get to a particular row and column, we need to first go down the first list and then enter the second level list and go to the right. So the top level is always a row number. Basically, the first index, indexes the first level list. A deeper list will be a deeper index. Page number, line number, word number.

Let’s now create the mltp table.

in K:

mltp:(a*)’a

in Q:

mltp:(a*) each a

mltp

0	0	0	0	0	0	0	0	0
1	2	3	4	5	6	7	8	9
2	4	6	8	10	12	14	16	18
3	6	9	12	15	18	21	24	27
4	8	12	16	20	24	28	32	36
5	10	15	20	25	30	35	40	45
6	12	18	24	30	36	42	48	54
7	14	21	28	35	42	49	56	63
8	16	24	32	40	48	56	64	72
9	18	27	36	45	54	63	72	81

So our reference tables can be a list of tables:

reftab:(add;mltp)

reftab[0] is the first table or the addition table.

reftab[1] is the second table or the multiplication table

I leave creating a subtraction table in reftab[2] to the reader. (hint use the comma(,) to join a subtraction table to reftab. Like so reftab:reftab,subtr;)

You can make a division table as well, you will see how K/Q handles division by zero.

Now let’s say you wanted to play 3 dimensional battleship.

Each player could either attack the surface or the submarine layer. Suppose in this really simple game you had two ships, each 2 units long. and there were only 6 squares. 2*3 board.

Here is a sample board:

Layer 0 or surface:

0 0 1

Layer 1 or submarine level

1 1 0

0 0 0

I have a submarine and a boat. Both are two units long. We need three coordinates to track down a single location.

First let’s make the surface layer

s: 0 0 1

sur: (s; s)

Next we can make the submarine layer

easy way

sub: (1 1 0; 0 0 0)

fancy way

sub: 2#|:+:sur,enlist 0 0 0 /Don’t worry I’ll explain.

sur, enlist 0 0 0 /gives us a three by three table. Note the comma used to append a list to a list.

(0 0 1;0 0 1;0 0 0)

+: sur, enlist 0 0 0 /flips the table on it’s side, sometimes called transpose

(0 0 0;0 0 0;1 1 0)

|: +: sur, enlist 0 0 0 /reverses the flipped table only reverses the top level list

(1 1 0;0 0 0;0 0 0)

2# |: +: sur, enlist 0 0 0 /takes the first 2 rows of the flipped table

(1 1 0;0 0 0)

Now we can make the whole board

b: (sur; sub)

((0 0 1;0 0 1);(1 1 0;0 0 0))

b[0]

(0 0 1;0 0 1)

b[0][1]

0 0 1

b[0][1][2]

And obviously we can make lists of lists of lists of lists…. and everything will work the same. But I’m stopping here.

A Better Way To Load Data into Microsoft SQL Server from Pandas

July 13, 2016July 29, 2016 / pindash91 / 18 Comments

This post is not going to have much theory.

Here was my problem. Python and Pandas are excellent tools for munging data but if you want to store it long term a DataFrame is not the solution, especially if you need to do reporting. That’s why Edgar Codd discovered, and Michael Stonebreaker implemented, relational databases.

Obviously, if I had the choice I wouldn’t be using Microsoft SQL Server [MSS]. If I could, I would do this whole exercise in KDB, but KDB is expensive and not that good at storing long strings. Other relational databases might have better integration with Python, but at an enterprise MSS is the standard, and it supports all sorts of reporting.

So my task was to load a bunch of data about twenty thousand rows — in the long term we were going to load one hundred thousand rows an hour — into MSS.

Pandas is an amazing library built on top of numpy, a pretty fast C implementation of arrays.

Pandas has a built-in to_sql
method which allows anyone with a pyodbc engine to send their DataFrame into sql. Unfortunately, this method is really slow. It creates a transaction for every row. This means that every insert locks the table. This leads to poor performance (I got about 25 records a second.)

So I thought I would just use the pyodbc driver directly. After all, it has a special method for inserting many values called executemany. The MSS implementation of the pyodbc execute many also creates a transaction per row. So does pymssql.
I looked on stack overflow, but they pretty much recommended using bulk insert.Which is still the fastest way to copy data into MSS. But it has some serious drawbacks.

For one, bulk insert needs to have a way to access the created flat file. It works best if that access path is actually a local disk and not a network drive. It also is a very primitive tool, so there is very little that you can do in terms of rollback and it’s not easily monitored. Lastly, transferring flat files, means that you are doing data munging writing to disk, then copying to another remote disk then putting the data back in memory. It might be the fastest method, but all those operations have overhead and it creates a fragile pipeline.

So if I can’t do bulk insert, and I can’t use a library. I have to roll my own.

It’s actually pretty simple:

You create your connection, I did this with sqlalchemy but you can use whatever:


import sqlalchemy
import urllib
cxn_str = &quot;DRIVER={SQL Server Native Client 11.0};SERVER=server,port;DATABASE=mydb;UID=user;PWD=pwd&quot;
params = urllib.quote_plus(cxn_str)
engine = sqlalchemy.create_engine(&quot;mssql+pyodbc:///?odbc_connect=%s&quot; % params)
conn = engine.connect().connection
cursor = conn.cursor()

You take your df.
I’m going to assume that all of your values in the df are strings, if they are not.
And you have longs, you are going to need to do some munging to make sure that when we stringfy the values sql knows what’s up.
(so if you have a long, then you need to chop the L off because MSS doesn’t know that 12345678910111234L is a number. here is the regex to get rid of it: re.sub(r”(?<=\d)L”,”,s ) credit to my coworker who is a regex wiz.

We need to tuplify our records, Wes Mckinney says:

records = [tuple(x) for x in df.values]

Which is good enough for me. But we also need to stringfy them as well.

So that becomes:

records = [str(tuple(x)) for x in df.values]

MSS has a batch insert mode that supports up to 1000 rows at a time. Which means 1000 rows per transaction.

We need the sql script that does batch insertion: we’ll call that insert_

insert_ = &quot;&quot;&quot;

INSERT INTO mytable
(col1
,col2
,col3
...)
VALUES

&quot;&quot;&quot;

Since we can only insert 1000 rows. We need to break up the rows into 1000 row batches. So here is a way of batching the records.
My favorite solution comes from here.

def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

Now all we need to do is have a way of creating the rows and batching.
Which becomes:

for batch in chunker(records, 1000):
rows = ','.join(batch)
insert_rows = insert_ + rows
cursor.execute(insert_rows)
conn.commit()

That’s it.
This solution got me inserting around 1300 rows a second. A 1.5 orders of magnitude increase.

There are further improvements for sure, but this will easily get me past a hundred thousand rows an hour.

Kafka On the Shore: My Experiences Benchmarking Apache Kafka Part II

July 13, 2016July 13, 2016 / pindash91 / Leave a comment

This is part II of a series on Benchmarking Kafka Part I can be found here:

In the first part we used Spark to blast a 2gb csv file of 10 million rows into a three machine Kafka cluster. We got the speeds down to about 30 seconds. Which means it would take about 4 hours to blast a Terabyte. Which is fast, but not blazing fast.

Any number I put here will become obsolete within a year, perhaps sooner. Nevertheless, I’ll put myself out there. If on modest hardware we could achieve 1 terabyte in 40 minutes that would be enough I think to impress some people. which is about 400mb/s

Now again, because of Kafka’s memory flush cycle. We can only get the speed we want up to 8gb per machine. Really less, because there is some Ram usage by the os itself and any other applications running on those machines, including in my case Spark usage. So conservatively we can try and get 4gb per machine. At 400mb/s for two minutes straight.

Using some tricks, this kind of throughput can be accomplished on pretty modest hardware.

no replication
partitioning

Now the hard part is finding a machine gun that can fire those messages that fast. A distributed solution seems like the best move and replicates real world type of messaging many sources each blasting away messages.

So I fire up a spark instance and load in a large csv file of 10 million rows ~1.8gb. I re-partition the data set to take advantage of the number of cores available to me. And then I run the mapPartitions function, which allows each partition to independently of all others blast kafka with all of it’s messages, eliminating much of the overhead.

I then get a sustained message blast of about 200mb without the machine falling over.

[my interest in benchmarking kafka has been temporarily put to rest. I no longer have a kafka cluster at my disposal and looking at more local messaging solutions; both are zero broker:

Zero MQ

Aeron]

Dosage Search [Part 2]

July 13, 2016July 13, 2016 / pindash91 / 1 Comment

So since publishing the first post on this topic. I have researched this question.

First I need to acknowledge, that my CS friends have all pointed out the similarity between this question and the two egg puzzle.

You are given two eggs and placed in a 100 floor building. You need to find the highest floor where the egg could survive a drop.

If you only have one egg, then you obviously just go up one floor at a time. Eventually the egg breaks or you reach floor 100. In first case, the floor below is the answer. In the second, floor 100, since there are no higher floors.

If you have 10 eggs. You can just do binary search and find the right floor very quickly.

Since you have two eggs you can do better than one egg. The post I linked to earlier has a solution that tries to ensure the same number of drops as the worst case of binary search. Another approach is to drop the first egg on 2^n floor. So the first time would be from floor 2 then from floor 4, then from floor 8 and that way quickly climb up the building. This is similar to how network connections back off when a service is not responding. Retrying less frequently as the number of retries goes up. (This allows the service to restart without falling over from the bombardment of ever more frequent requests)

As similar as this egg approach is, it fails to capture something that I think is essential to the dosage question. That is that, there is a difference between making a mistake of dropping the egg the fifty floors too high and making a mistake by one floor. — Even if during the search you can’t tell, as that would be an obvious way to find the result in two tries.

So, I ran the code from the previous section and got some result for lists that had up to 1000 elements. However, python started to take too long for larger lists.

This prompted me to rewrite the functions into q. q and it’s sibling k, is an obscure language except to those in the financial services industry. It was created by a genius, Arthur Whitney, whose interview speaks for itself. The language itself is a derivative of APL and Lisp with a few more good ideas thrown in for good measure. (More on this language in a different post)

Anyway the same algorithm as before which in python took 30+ hours ran in 20 minutes in q.

The relative speed of the k code. Allowed me to see a pattern for the divisor. As the list grows longer, the divisor that we use to split the list should get larger, albeit slowly.

Data (More at the bottom)

Number of elements in list	divisor	Average cost
50	5	7.78
100	6	11.48
500	13	27.128
1000	17	38.85
4000	31	78.87175
5000	35	88.3382
10000	49	125.4369

Since the divisor wasn’t growing for lists that were similar in size, I realized the mistake I made by keeping the divisor constant. Obviously as the list gets shorter we need to update the divisor that we are using to divide the list, since the problem has been reduced to searching a smaller list.

This led me to create another type of binary search with an updating divisor. The divisor grows proportional to log(n)^3.

This also increased the speed of the search since I was no longer doing linear search on any list smaller than the divisor I was testing. To explain: if you have a static divisor, then when you start you have a list of 1 million elements. So you divide by 200 and you know if the item you are searching for is in the first range [0:5,000) or in the second range (5000: 1,000,000]. However, gradually the list gets smaller, but you keep dividing the list in this very uneven way, so that when the number of elements is less than 200, you keep looking at the first element. This is equivalent to linear search.

If instead we start to divide the list by smaller divisors, then we can get closer to binary search and since much of the list is eliminated our chances of making a huge mistake are also smaller.

Here is the q code with the two versions of search: (this might be unreadable to most people but I plan on having a tutorial series on k, q, kdb soon)

dif_cost: {:x-y};

trivial_cost: {x+y; :1};

scalar_cost: {x+y; :100};

/binary search with cost function

st: {[l;item;cf;d] cost:0; while [((count l) > 1); m: (count l) div d; $[l[m]<item; [cost+:1; l: (m+1)_l]; l[m]>item; [cost +: (cf[l[m]; item]); l:m#l]; :cost+:1];]; cost};

/calculates average cost for a particular divisor and n, by searching for each element

/then avg all those costs

st_avg: {[n; d] i: til n; res: @[i; i; st[i; ; dif_cost; d]]; avg res };

/iterates over divisors only going from 2 to 10 *( floor (log (n)))

/example for 100 it goes from 2 3 4 … 37 38 39

/ this covers the divisors that are relevant and minimizes the cost and removes unneeded computations

st_div: {[n] d: $[n<50; 2_til n; 2_til 10 * floor log n]; res: st_avg[n] each d; d[first where res = min res],min res}

/Algorithm with updating divisor

s_auto: {[l;item;cf, f] cost:0; while [((count l) > 1); d: max (2, floor (log (count l) xexp f) ); m: (count l) div d; $[l[m]<item; [cost+:1; l: (m+1)_l]; l[m]>item; [cost +: (cf[l[m]; item]); l:m#l]; :cost+:1];]; cost};

/Then we can minimize over f, which is the factor that we exponentiate the log N.

Data continued:

num	d	cost
50	5	7.78
100	6	11.48
150	7	14.32667
200	8	16.755
250	9	18.768
300	10	20.72667
350	11	22.49143
400	11	24.15
450	12	25.66
500	13	27.128
550	13	28.51455
600	13	29.85833
650	14	31.12
700	14	32.34143
750	14	33.50267
800	15	34.62875
850	15	35.75529
900	15	36.84667
950	16	37.84526
1000	17	38.85
1050	17	39.83143
1100	17	40.78818
1150	17	41.75652
1200	17	42.7125
1250	19	43.6
1300	19	44.46538
1350	19	45.35333
1400	19	46.18
1450	19	47.02966
1500	19	47.84467
1550	19	48.68581
1600	20	49.46
1650	21	50.25697
1700	21	51.01176
1750	21	51.79314
1800	21	52.54167
1850	21	53.27514
1900	22	54.02211
1950	22	54.73641
2000	22	55.4305
2050	23	56.16927
2100	23	56.82714
2150	23	57.52884
2200	23	58.20273
2250	23	58.88222
2300	23	59.55609
2350	25	60.20553
2400	25	60.85167
2450	25	61.47714
2500	25	62.0944
2550	25	62.74235
2600	25	63.33962
2650	25	64.00113
2700	25	64.59259
2750	26	65.21273
2800	27	65.84
2850	27	66.39614
2900	27	67.0331
2950	27	67.58475
3000	27	68.20133
3050	27	68.73803
3100	27	69.29355
3150	28	69.88635
3200	27	70.44688
3250	28	71.00954
3300	28	71.56636
3350	29	72.11164
3400	29	72.64853
3450	29	73.16986
3500	29	73.70571
3550	30	74.30901
3600	29	74.79778
3650	29	75.31123
3700	31	75.84622
3750	31	76.37627
3800	31	76.87974
3850	31	77.36545
3900	31	77.85179
3950	31	78.39165
4000	31	78.87175
4050	31	79.37975
4100	31	79.88659
4150	31	80.37084
4200	31	80.87167
4250	32	81.35765
4300	31	81.8307
4350	33	82.3223
4400	33	82.81409
4450	32	83.28472
4500	33	83.76067
4550	33	84.21473
4600	33	84.69391
4650	33	85.15935
4700	33	85.60809
4750	33	86.07411
4800	33	86.555
4850	33	86.98887
4900	33	87.45633
4950	35	87.92222
5000	35	88.3382
10000	49	125.4369

Dosage Search [Part 1]

July 13, 2016July 13, 2016 / pindash91 / 2 Comments

Sometimes, you come across problems that you think should be well studied but aren’t.

(please correct me if I’m wrong)

Medicine is famous for not having good cross pollination with the maths. See this famous paper which rediscovers the Riemann sum.

However, as bad as medicine is at noticing math, most disciplines are especially bad at adapting the study of algorithms, one of the most theoretical branches of Computer Science. This is true even for seasoned programmers.

There is a reason for this. Brute force works remarkably well. In fact, so well that Ken Thomspon, author of the Unix operating system, said:

“When in doubt, use brute force.” – Ken Thompson

There are some geniuses in the field, notably Don Knuth. In Coders At Work some of the best Computer Scientists admit to reading little to none of his series The Art of Computer Programming, saying the math is too difficult.

So, it’s not surprising that medical dosages are done with brute force. There is a method for picking the starting point, called median effective dosage which produces a result in 50% of the population.

However, doctors are averse to either overdosing or underdosing depending on the condition. They tend to start at some point and then proceed straight through the search space until they get to the right level. This algorithm is called linear search. It is brute force and it is guaranteed to find the right level should it exist.

There are much better ways to do a search assuming a sorted list. Let’s create a game to illustrate this:

Suppose I chose a number between 1 and 100 and You had to guess which number I chose. Every time you guessed, you pay me a dollar and I’ll tell you if you are above or below. If you guess correctly I pay you 10 dollars. Clearly, you are not going to start at 1 and just keep guessing up to 100. Instead you can start at 50 and then eliminate half the range. If you keep halving what’s left you can find any number in 7 tries. Thus earning 3 dollars of profit.

The process of halving the interval is called binary search and it works great when the cost of guessing to high or to low is the same. In this case you paid a dollar for each guess. Suppose though instead, you paid 2 dollars if you guess above and only 1 dollar if you guess below. What strategy should you pursue?

What if you pay the difference between the guess and the true answer and 1 dollar if it’s below. Obviously, you would pay this at the end of the game so as not to reveal the answer immediately. What does the reward have to be to make this game fair and what is the optimal strategy. In effect, this is the game doctors play when they are prescribing a dosage. Every guess, costs the patient time and health, but there is a higher cost to over prescribing or under prescribing. And this asymmetry can be captured in such a game.

A possible solution is to divide the interval into 4 and then guess that number. Keep doing that and you are guaranteed to get an answer as well and you’ll also guess under about 3/4 of the time. The right way to sub divide the interval and guess depends on the cost of each guess and the cost of guessing over. Is there an answer? Sure. But I haven’t seen a medical journal that has researched this question and so we do brute force, and that’s a problem.

Below is python code that simulates these questions:

def linear_search(alist, item, cost_func):
	first = 0
	last = len(alist)-1
	found = False
	cost = 0
	while first&lt;=last and not found:
		current = first
		if alist[current] == item:
			found = True
		else:
			if item &lt; alist[current]:
				cost = cost + cost_func(alist[midpoint],item)
			else:
				cost = cost + 1
				first = current+1
	return cost+1
def binary_search(alist, item, cost_func, cost=0, divisor=4):
	blist = list(alist)
	if len(blist) == 0:
		return cost
	else:
		midpoint = len(blist)//divisor
		if blist[midpoint]==item:
			return cost + 1
		else:
			if item&lt;blist[midpoint]:
				c = cost + cost_func(blist[midpoint],item)
				return binary_q_search(blist[:midpoint],item, cost_func, c)
			else:
				c = cost + 1
				return binary_q_search(blist[midpoint+1:],item, cost_func, c)
def binary_q_search(alist, item, cost_func, cost=0, divisor=4):
	blist = list(alist)
	if len(blist) == 0:
		return cost
	else:
		midpoint = len(blist)//divisor
		if blist[midpoint]==item:
			return cost + 1
		else:
			if item&lt;blist[midpoint]:
				c = cost + cost_func(blist[midpoint],item)
				return binary_q_search(blist[:midpoint],item, cost_func, c)
			else:
				c = cost + 1
				return binary_q_search(blist[midpoint+1:],item, cost_func, c)
def trivial_cost_func(guess, true):
	return 1
def scalar_cost_func(guess, true):
	return 100
def dif_cost_func(guess, true):
	return (guess-true)
lin = []
binar = []
quar = []
cost_func = scalar_cost_func
a = 1000
for i in xrange(0,a):
	lin.append(linear_search(xrange(0,a),i, cost_func))
	binar.append(binary_search(xrange(0,a),i, cost_func))
	quar.append(binary_q_search(xrange(0,a),i, cost_func,divisor=4))

	print &quot;The average cost for linear search: &quot; + str(sum(lin)/float(a))
	print &quot;The average cost for binary search: &quot; + str(sum(binar)/float(a))
	print &quot;The average cost for quarter search: &quot; + str(sum(quar)/float(a))

Continued in Part 2

	pt.instafollowfast.c… on Backtracking In Q/ word l…
	Alex on The Is-Ought Distinction
	Fred Mleczko on How the Mighty have Fallen, or…
	pindash91 on A Better Way To Load Data into…
	Andrew Stanton (@and… on A Better Way To Load Data into…

iabdb

It's All Been Done Before.

Month: July 2016