**Note This post was written over a two week period where I kept learning more about this problem. Many of the notes refer to later parts of the post, so that caveats can be explored. Also some parenthetical notes are almost incomprehensible without reading many of the linked articles on the page. There is little I can do to summarize all of those articles, this post is already way too long.
This post is inspired by a friend of mine who was solving a particularly common “Big Data” problem. (I enjoyed the last two weeks immensely, and have a much greater appreciation for Ken Thompson, Andrew Gallant, Richard Stallman, and of course Arthur)
[I use Big data loosely, it usually relates to a problem being too large for the machine that you have access too. A great essay can be found in Programming pearls Vol 1 Cracking the Oyster. Where a big data problem is one that uses more than a megabyte of memory. ]
Essentially, my friend had a data set that contained roughly 100 million rows and each row had a column that was a free text description. A human could easily read several records and identify which records belonged together in one category, but the sheer number of records prevented any such attempt to be made. On the other hand, a computer using only a small set of keywords would easily misclassify many of the records. It would be easier to create more categories than there were and then combine categories based on statistical analysis. My friend tried fuzzy matching using something I covered in an earlier post, however, the algorithm slows down drastically with larger datasets.
By the time he told me his problem he had already invented an on the fly hashing and categorization algorithm that I have still yet to fully grok (his algorithm might be faster but it is optimized for his specific problem).
Problem Statement: Assuming you have a set of rows with descriptions and a set of keyword/tokens, assign each row to one or more tokens.
In particular every token will have a set of rows that belong to it. Rows can be assigned to multiple tokens and tokens have many rows. The total number of tokens was 1 million. The sample dataset would have 10 million rows [limited by the memory on one machine].
The first approach was to use pandas and use groupby methods. However, that was disastrously slow because Pandas strings are actually Python objects instead of being Numpy objects and are simply not optimized for speed.
This led me to move to pure Numpy. I started by constructing a toy problem. The toy problem would have a million rows. Each row would consist of only 5 randomly selected lowercase English letters separated by spaces. The goal would be to identify all the rows that contained each letter.
Here is that experiment:
In [1]:
Out[4]:
|
d |
x |
0 |
b c v l y |
0 |
1 |
m l q u t |
1 |
2 |
b f s i n |
2 |
3 |
g p e k h |
3 |
4 |
j k a b l |
4 |
Out[6]:
array([b'b c v l y', b'm l q u t', b'b f s i n', ..., b'p f k n h',
b'y p t o i', b'e g o a i'],
dtype='|S1000')
Out[7]:
[(array([ 4, 24, 39, ..., 999967, 999990, 999999]),),
(array([ 0, 2, 4, ..., 999990, 999992, 999993]),),
(array([ 0, 9, 11, ..., 999978, 999980, 999991]),),
(array([ 12, 15, 18, ..., 999979, 999983, 999985]),),
(array([ 3, 13, 14, ..., 999991, 999995, 999999]),),
(array([ 2, 7, 13, ..., 999980, 999989, 999997]),),
(array([ 3, 6, 7, ..., 999992, 999994, 999999]),),
(array([ 3, 6, 10, ..., 999987, 999993, 999997]),),
(array([ 2, 8, 9, ..., 999993, 999998, 999999]),),
(array([ 4, 5, 10, ..., 999990, 999993, 999994]),),
(array([ 3, 4, 5, ..., 999988, 999996, 999997]),),
(array([ 0, 1, 4, ..., 999984, 999987, 999996]),),
(array([ 1, 12, 15, ..., 999964, 999975, 999979]),),
(array([ 2, 9, 10, ..., 999985, 999989, 999997]),),
(array([ 16, 21, 25, ..., 999996, 999998, 999999]),),
(array([ 3, 6, 12, ..., 999992, 999997, 999998]),),
(array([ 1, 15, 20, ..., 999986, 999988, 999989]),),
(array([ 5, 13, 17, ..., 999974, 999984, 999989]),),
(array([ 2, 14, 18, ..., 999990, 999994, 999996]),),
(array([ 1, 7, 19, ..., 999985, 999994, 999998]),),
(array([ 1, 7, 8, ..., 999993, 999994, 999995]),),
(array([ 0, 10, 11, ..., 999974, 999983, 999988]),),
(array([ 6, 9, 18, ..., 999987, 999990, 999991]),),
(array([ 17, 25, 28, ..., 999988, 999992, 999995]),),
(array([ 0, 5, 7, ..., 999991, 999995, 999998]),),
(array([ 8, 17, 18, ..., 999976, 999995, 999996]),)]
1 loop, best of 3: 22.5 s per loop
The toy problem took 22.5 seconds on my core i7 desktop from 2015. All the benchmarks were done on the same machine to prevent skewing of results.
I decided to abandon Numpy. 22 seconds for 26 tokens would not scale. At this rate finding the matches for a million tokens would take (1 million / 26) * 22 seconds = 9.79344729 days. Multiplying by another 10 because my problem was only on 1 million rows would mean this problem was going to run for 90 days at least, you can’t even return most clothes after that much time.
So I turned to my favorite language optimized for big data. KDBQ.
Here is basically the same code as Numpy with comments inline:
.Q.a is just all the lowercase ascii letters
/create a function g that will generate the random string
g:{raze ” “,’5?.Q.a}
/create a table with a million rows using g and column d
t:([]d:g@’1000000#0)
/demonstrate the function on one letter “a” for the first 10 rows.
/This creates a boolean mask where 1 means that the row contains “a”
{first `boolean$ss[x;”a”]} each 10#t.d
0000000010b
/use where to save only the rows that have the letter
where {first `boolean$ss[x;”a”]} each 10#t.d
,9
/combine this function and apply it for each letter returning 26 rows with the indexes
/for the first 10 rows
{where {first `boolean$ss[y;x]}[x] each 10#t.d} each .Q.a
,9
2 4
1 3
`long$()
1 5 9
3 7 8
`long$()
,4
5 8 9
,7
1 7 9
,1
`long$()
5 8
,0
6 9
4 5 6
0 2 3
,2
`long$()
`long$()
0 2
..
/Time this function for a million rows
q)\t {where {first `boolean$ss[y;x]}[x] each t.d} each .Q.a
5335
/Try to parallelize the problem with 6 slave threads using peach
q)\t {where {first `boolean$ss[y;x]}[x] each t.d} peach .Q.a
3104
/Try to further parallelize the problem, no luck
q)\t {where {first `boolean$ss[y;x]}[x] peach t.d} peach .Q.a
3087
KDB has done a lot better than Numpy on the same problem, BUT this problem will still take: 1.33547009 days * 10 because this was only a million rows. We get 10 days, which is more reasonable. But we can do better.
I decided to take advantage of the KDBQ enumerated string AKA symbol type.
This led to the following code and timings:
q)g:{raze `$’5?.Q.a}
q)t:([]d:g@’1000000#0)
q)first t[`d]
`f`j`r`p`j
q)tokens:`$’.Q.a
q)\t {where {x in y}[x] each t.d} each tokens
3796
q)\t {where {x in y}[x] each t.d} peach tokens
2218
q)\t {where {x in y}[x] peach t.d} peach tokens
2288
Definitely an improvement, but nothing to write home about.
At this point, it was time to start looking to improve the algorithm. A quick StackOverflow search lead me to the name of the problem. MultiString/MultiPattern Search. There are two dominant algorithms: Aho-Corasick and Rabin-Karp. Aho-Corasick was developed by Aho of AWK fame and Margaret Corasick (not much is written on the internet about Margaret Corasick, except that she was clearly one of the leading female pioneers in computer science). The Algorithm is linear in the number of inputs and the number of tokens and the number of matches. The Rabin-Karp algorithm, in true Rabin style, is technically not linear but average case linear assuming a good hashing function is chosen and it involves using prime numbers so that you can roll through the string without looping back, more here.
I was going to start implementing one of these algorithms, but as I was searching for a good implementation, I found that the reference implementation was to be found in GNU Grep. So out of curiosity, I tried running my toy problem using grep.
I used KDBQ to write out my table to t.txt and deleted the first line so that I just had the rows of strings on each new line. I then saved the another file called tokens.txt where each new line was an ascii letter. I then constructed the necessary grep to get the line numbers where a token appears.
grep -n token filename
…
999918: x b w a b
999923: a v u p u
999924: y v i a p
999927: c k j a g
999930: a l n i p
999935: p a b e k
999944: f a a n t
999949: b u a h p
999951: n j x u a
….
This will give you the line numbers and the matches using cut command you can return only the line numbers.
Here is what that looks like:
$ grep -n “a” t.txt | cut -d: -f1
…
999918
999923
999924
….
I then decided to go read the tokens.txt file line by line and grep for the line in the table of a million rows. Each result of line numbers is written to a file in the results folder with that token name. So if we want all the rows that match a token we simply look at the row numbers in that file.
time while read LINE; do grep -n $LINE t.txt| cut -d: -f1 > results/$LINE; done <tokens.txt
real 0m0.749s
user 0m0.852s
sys 0m0.044s
Less than a second and we haven’t even parallelized the operation. Adding & so that things run in the background and waiting for the process to finish we see the toy problem for what it is, A TOY.
$ time $(while read LINE; do grep -n $LINE t.txt| cut -d: -f1 > results/$LINE & done <tokens.txt; wait)
real 0m0.216s
user 0m1.424s
sys 0m0.112s
With parallelization the toy problem takes less than a fifth of a second. Estimating the run time for the original problem we see that it will now take: (1 million/26)*.216 seconds = 2.3 hours * 10 this is now in the less than a day category.
Curious and surprised that grep was reading off the disk faster than KDBQ was reading from memory, I decided to find out why grep was so FAST.
Summarizing one of the original authors of GNU grep:
- Like KDBQ, GREP memory maps files
- Like KDBQ grep will often use the Boyer-Moore search algorithm that basically allows quick skipping of bytes where no match is possible.
- Grep avoids looking at the lines and only reconstructs line breaks if there is a match
Mike Hartel’s famous quote from that article is very enlightening.
The key to making programs fast is to make them do practically nothing. ;-)
So essentially KDBQ is slower than grep because it still thinks of strings in each row as independent rows and it isn’t designed solely for ripping through strings. The tool was originally the brainchild of Ken Thompson of K&R C fame. So is it surprising that after 40 years the tool that was optimized for string searching is still the king? Not really, but we thought Arthur could do better, we eagerly await K6 which promises to be an entire OS, which will probably include something like grep. (I later realized that I could use Q in a different way to achieve the same result much faster, keep reading… )
A quick aside, for those unfamiliar with grep. Grep is a Unix command utility, that allows searching files for strings that match on a line. Many options were added as Grep developed. Grep gave you a way to search for regular expressions. That is, for things that are similar to a certain string. The extended regular expression engine was originally Aho’s invention and was called EGrep. This merged into Grep as the argument -e. FGrep gave you a way to search for exact matches much faster which became -f. Similarly, -n allows you to print line numbers and -o allows you to print only the matching pattern.
Anyway, knowing something about the algorithms that allow grep to run so quickly, I decided that it was time to try a more realistic example. Firstly, I thought that longer words might further advantage grep since it could skip more quickly through the file, secondly, I wondered how linux would handle opening many background processes at once.
So to construct the more realistic dataset I went and found a file that contained the 20,000 most common english words.
I read the files into KDBQ and created my tokens and my table:
q)tokens:read0 `:20k.txt
/now instead always taking 5 random tokens we are randomly taking between 1 and 10 /random words without replacement, so that we get a unique set for each row.
q)g:{” ” sv (neg first 1+1?10)?tokens}
q)t:([]d:g@’1000000#0)
q)\t {where {first `boolean$ss[y;x]}[x] each t.d} peach 10#tokens
2453
q)\t {where {first `boolean$ss[y;x]}[x] peach t.d} peach 10#tokens
2498
q)\t {where {first `boolean$ss[y;x]}[x] each t.d} peach 100#tokens
12188
q)\t {where {first `boolean$ss[y;x]}[x] peach t.d} peach 100#tokens
12329
q)\t {where {first `boolean$ss[y;x]}[x] peach t.d} peach 1000#tokens
184171
What we see is that getting the list of indexes for 10 tokens takes about 2.5 seconds, with no significant difference from adding another peach. Getting the indexes for 100 tokens takes 12 seconds, which looks like an improvement, to about 1.2 seconds per 10 tokens. However, getting the indexes for 1000 tokens takes about 3 minutes for an average of about 2 seconds per 10 tokens. So KDBQ is definitely performing better on longer tokens, and tokens get longer deeper into the list so that is contributing to the speedup. However, we really are just using KDBQ to place the files on the filesystem for grep to pick them up.
q)save`:t.txt
/save only the top 100 tokens in a separate file
q)tokens:([]100#tokens)
q)save `:tokens.txt
Now we can see how grep does on this much more realistic problem.
Running grep on 100 tokens; we get the following timing result:
time $(while read LINE; do grep -n ” $LINE ” t.txt| cut -d: -f1 > results/$LINE & done <tokens.txt; wait)
real 0m4.460s
user 0m6.684s
sys 0m0.884s
So for 100 words it took 4.46 seconds, so less than 5 seconds. I decided to run it for the whole set of 20,000 words:
Here are the timings experimenting with using Fgrep and using and not using background processes:
It took 5 minutes to run on 20k tokens for a million line table.
time $(while read LINE; do grep -n ” $LINE ” t.txt| cut -d: -f1 > results/$LINE & done <20k.txt; wait)
real 4m54.079s
user 17m13.756s
sys 2m9.064s
Not running in parallel is a bit faster in terms of total CPU time, but is almost 6 times slower in wall clock time, which means you will be waiting awhile.
time while read LINE; do grep -n ” $LINE ” t.txt| cut -d: -f1 > results/$LINE; done <20k.txt
real 23m41.263s
user 14m14.508s
sys 1m57.508s
Using Fgrep was marginally faster because it searches a literal string without any pattern matching support, if you need pattern matching then you can’t use this, but the performance improvement was not all that significant. (This should have been a clue that something was wrong, see later in this post)
time $(while read LINE; do fgrep -n ” $LINE ” t.txt| cut -d: -f1 > results/$LINE & done <20k.txt; wait)
real 4m29.731s
user 17m13.004s
sys 2m2.336s
Now multiplying through for the more realistic example we find that (1 million/20k)*5 minutes=4.1 hours, multiplying by 10 we get 40 hours. So it seems that our toy problem was underestimating the difficulty of the problem.
40 hours seemed like too much time to wait. So off I went in search of something better. Luckily, I came across ripgrep. Ripgrep aims be to be a better version of Grep written in Rust instead of C. [Rust is a language that aims to be a better version of C, so this is fitting. The name rip grep, I’m sure has something to do with R.I.P grep]. The post on Ripgrep taught me a lot about the underlying implementations of different search tools. It made me realize that some of the information I was reading from older posts was outdated. In particular, grep no longer used memory maps (–mmap), even if you asked for them explicitly. So grep was fast, but not as fast as it could be. Also, the newest grep had somehow increased it’s speed on non-ASCII characters. So some of the traditional speedups like setting up LC_ALL=C, which forces grep to use ASCII characters doesn’t actually increase the speed. Which many of these StackOverFlow answers mention here, here, and here.
Here are some timings that explicitly contradict these results:
Searching for aardvark in a 10 million row file results in no speed improvement when switching between ASCII and UTF-8
$ time LC_ALL=en_US.UTF-8 grep -F aardvark -w -o -n t.txt > results/aardvark
real 0m0.339s
user 0m0.260s
sys 0m0.076s
$ time LC_ALL=C grep -F aardvark -w -o -n t.txt > results/aardvark
real 0m0.340s
user 0m0.276s
sys 0m0.064s
Also, many posts recommended using GNU parallel, but I found it did a poor job of actually distributing work onto all the cores of my CPU and that it was probably much better if you were going to send work to multiple machines. Instead simply using background processes was going to be fine.
Once I read the article about ripgrep, it was time to test its bold claim that it was going to be faster than all the other tools. My more realistic example with 20k words became really easy for ripgrep, once I figured out that the 20k tokens can be split into smaller 200 line files with split. Split is a tool written by Richard Stallman that splits the original files into files with the prefix specified. Here I use ‘S’.
$split -l200 20k.txt S
The small-token-files could be used to match against the large input and output a line number and a match on each line.
For example, in our toy problem, if we want to match on letters a-z. We can break it up into chunks of 2 letters. We will get 13 files.
file 1.–> corresponds to letters a,b
1:a
3:b
4:a
4:b
…
file 3. –> corresponds to letter e,f
1:e
15:f
74:e
80:e
…
So that for each token-file we would get a corresponding file with matches. Here is that command and timing:
$ time $(for i in S*; do rg -F -f $i -n -o -w t.txt>results/groupby$i & done; wait)
real 0m36.187s
user 3m23.104s
sys 1m15.308s
Now my more realistic problem seemed like a joke. It was taking less than a minute to go through a 10 million row file with 20 thousand tokens. I was starting to become very impressed with ripgrep. However, I decided that I first needed to try and find the older versions of grep that supported mmap to make the comparison fair.
After building from source no less than 8 versions of grep, I discovered some very curious things.
- grep lost mmap support as of 2010, so any version after 2.6 essentially was not going to work, and my current ubuntu only came with 2.25, 25 is larger than 6.
- Old versions of grep were in general faster on large input patterns than new versions, they built the Aho-Corasick tree faster.
- Grep versions before 2.5 did not support printing only the match (-o) option, which was necessary for us to collect all the indexes back together at the end.
- So Grep versions 2.5.x were the only ones to support both -o and –mmap options
I ended up settling on grep 2.5.1 for a very strange reason. I was excited that grep 2.5.1 was going really fast on some test inputs, when I discovered that when matching on multiple patterns at once and printing the line number. I would get the following output:
1:tenor
bleeding
fertilizer
colchester
2:mining
swap
everyone
miriam
persuasion
….
It seemed that grep had decided that it would be easier to just print the line number once per many matches. So I went to the next version grep 2.5.4, which printed the matches as expected.
1:tenor
1:bleeding
1:fertilizer
1:colchester
2:mining
2:swap
2:everyone
2:miriam
2:persuasion
….
However, it was an order of magnitude slower. So I ended up attaching a small AWK program that would add the line numbers for the lines that had them implicitly. Here is that AWK program:
awk -F: ‘NF==2{line=$1} NF==1{$0=line FS $0}1’
This AWK program had almost no overhead when I tested it so it was much better than relying on grep to print all the line numbers for each match.
Once I had the right version of grep, things really started moving. I discovered that grep could match many more tokens at once than ripgrep.
In fact, my more realistic example with 20 thousand tokens ran in the same amount of time whether or not I split the token files. In fact, the overhead of starting many processes was slowing down grep2.5.1.
MOST REALISTIC PROBLEM, (yet):
This led me to go and create the most realistic example I could without generating nonsense words myself. (I might do this later, but I was feeling a bit lazy and I think this problem is close enough). I went here and got 355k words. I went and recreated my dataset. I had a file of tokens.txt which had 354950 words to be exact. I deleted about 40 words that started with a ‘(single quote) because ripgrep has a bug (bug was fixed within 14 hours of posting it, amazing!) that prevents it from properly printing out only matches that match exactly that word[-o -w]( a very low priority bug , but I figured it was worth filing and skipping those words). I created my 10 million row dataset which had a random number of words between 1 and 10 from the tokens file. I saved that into t.txt.
I then created two very similar scripts that allowed ripgrep and grep2.5.1 to finally compete.
grepper.sh:
#!/bin/bash
FILE=$1
export LC_ALL=C
pid_count=0
for F in S*
do
~/Downloads/grep-2.5.1/src/grep –mmap -F -f $F -n -w -o $FILE|awk -F: ‘NF==2{line=$1} NF==1{$0=line FS $0}1’ >results/group$F &
if (( ++pid_count > 4 )); then
wait -n
((pid_count–))
fi
done
wait
rger.sh
#!/bin/bash
FILE=$1
pid_count=0
for F in S*
do
/home/pinhus/Downloads/rg –mmap -F -f $F -n -w -o $FILE >results/group$F &
if (( ++pid_count > 4 )); then
wait -n
((pid_count–))
fi
done
wait
The two programs behave very similarly and produce identical output. So I will describe them together.
The program takes a filename argument $1 and creates a count of how many processes are running so far. For each pattern File S* it runs a literal match on each patternfile and prints the line numbers along with each match into a corresponding file inside the results folder called group{patternfile}. It runs this in the background and increments the number of running processes. It waits until there are less than 4 running processes, decrements the number of processes and adds another process for the next patternfile. When it is all done it will wait until all the pattern files are written out.
The main difference between them is that the grep version uses strictly ASCII characters whereas the ripgrep version uses UTF8 by default.
There is also a difference in how to optimally split the pattern file. After much experimentation, I found that ripgrep could comfortably handle up to about 400 lines at a time, whereas grep2.5.1 was optimal with 20k lines at a time.
Here are the respective split commands:
#grep split, assuming 355k lines token file
# -n will create N chunks and not break a lines if you add l/N
split -n l/20 tokens.txt S
#ripgrep split assuming 355k lines in token file
split -n l/888 tokens.txt S
As you can see there are only 20 chunks created for the grep version and there are 888 files for the ripgrep version.
However, the real kicker is that the grep version now runs faster, because we have taken care of the overhead of having too many process handles at once.
Here are the timings:
time bash grepper.sh t.txt
real 1m27.812s
user 2m55.840s
sys 0m2.100s
time bash rger.sh t.txt
real 3m29.019s
user 25m19.516s
sys 1m55.148s
Long Live grep!
Multiplying to the original problem of 1 million tokens we now see that grep will take about 4.5 minutes. Let’s say 5 minutes to be safe. And ripgrep will now take about 10.5 minutes.
The reason that grep now beats ripgrep is that grep uses Aho-Corasick to do multiple matching, whereas ripgrep actually uses a clever version of regex. Instead of implementing Aho-Corasick, ripgrep joins the entire pattern file with | which makes the regex engine match any of the those patterns. Ripgrep then goes one step further and tries to optimize this by searching for a common literal among all those patterns so that it can skip using the regex engine which is much slower. If it finds a common literal string among the patterns it will submit it to it’s implementation of the teddy algorithm, which takes advantage of Intel’s more recent SIMD instructions (single instruction multiple data, AKA vectorized processing). The teddy algorithm submits the literal to be checked against 16 bytes at a time and if it finds a match it will go into the slower regex engine to verify. This implementation forces our pattern files to be smaller since a pattern-file that is too big will not have a common literal or will match too often. In fact, the pattern-files have a limit that will throw an error if they are too big because they overwhelm the regex engine. Because of this limitation, ripgrep has to search the entire file many more times than grep has to. Additionally, ripgrep will perform terribly if the original patternfile is not sorted. I suspected that ripgrep was benefitting from our sorted tokens file, and was able to find common literals, which was allowing it to be much faster than a random tokens file. To test this I used round robin split,
#ripgrep split assuming 355k lines in token file
# replace the l with r. which causes the lines to be round robined across 888 files.
split -n r/888 tokens.txt S
The difference was staggering, using a 100 thousand line input file:
#pattern file is sorted and broken into sorted chunks
time bash rger.sh t100k.txt
real 0m8.205s
user 0m59.716s
sys 0m1.400s
#pattern file is deliberately round robined across chunks
time bash rger.sh t100k.txt
real 1m11.450s
user 9m8.252s
sys 0m2.092s
The difference is between 8 seconds for a 100 thousand row file and 1.2 minutes. Almost an order of magnitude difference. Sort your pattern file before using ripgrep. Also because ripgrep uses a nonnaive version of Boyer-Moore it is able to skip many more bytes. (We eagerly await this optimization from Arthur, after all SIMDs are a natural part of vectorized languages.)
On the other hand, because grep uses Aho-Corasick it can handle much larger pattern files, the chunks merely facilitate creation of a smaller tree to traverse and you don’t need to sort your pattern files. This also allows grep to read the input file many fewer times. Once again, algorithms beat brute force (kind of, see last note where I discover that many matches were missed because of a bug in 2.5.1, most of the points stand but I think you might be better off using ripgrep).
Now to collect the outputs and create the final result, which is a token and all of it’s associated row numbers, we go back to KDBQ.
All of our output files live in the results directory and look something like this:
402:bicultural
630:bicapitate
697:bibby
861:biblicist
945:bicylindrical
951:bibliopolist
1200:bicalcarate
So we need a function that gets all the files from a directory:
tree:{ $[ x ~ k:key x; x; 11h = type k; raze (.z.s ` sv x,) each k;()]}
This is one of the utilities provided in the KDB Cookbooks/tips. It will recursively go down a directory and give you all the files. We just need 1 level, but it’s fast enough, so there is no need to write our own.
We create a table that will have two columns, an index column and a token column.
t:([]index:();token:())
We then create a small function that will read in each file split on the colon and parse the line and upsert it into the table.
{`t upsert flip 0:[(“IS”;”:”);x]}
0: will read a text file and takes arguments that specify what it expects see, as well as a delimiter. In this case a colon.
flip arranges the tuple to be aligned with the table
`t upsert will insert tuple.
By default 0: will read the two columns into rows, so we need to flip it.
q)(“IS”;”:”) 0: `:results/groupSaaa
72 165 567 698 811 838 1003 1136 12..
abiliment abience a1 4th aberuncator abbey abductors abdominohysterectomy 10..
q)flip (“IS”;”:”) 0: `:results/groupSaaa
72i `abiliment
165i `abience
567i `a1
698i `4th
811i `aberuncator
838i `abbey
Running this on each file in the results directory like so:
q)\t {`t upsert flip 0:[(“IS”;”:”);x]} each tree `:results
4361
q)\t select index by token from t
567
q)select index by token from t
token| index ..
—–| ———————————————————————-..
1080 | 1283 127164 391580 443156 543775 561432 608795 862810 889780 1029079 1..
10th | 27747 57768 238851 293795 332990 487295 564854 732020 755630 834528 83..
1st | 37915 113165 175561 203936 213171 318590 382626 411531 473224 653353 6..
2 | 26642 67272 108161 122598 192496 211019 250924 281576 300950 344445 45..
2nd | 3275 50750 93226 99089 107105 208955 212762 234366 238134 403142 48216
Reading in all the output files and aggregating by token takes less than 5 seconds.
However, once I did this aggregation, I thought there must be a better way to take advantage of a KDB only solution.
PURE KDBQ:
If you knew that in the description, the only tokens you were interested were words and that the only types of word breaks were spaces then you could do following:
/read in the raw 10 million row file
q)read0 `:t.txt
“cellulipetal outname kymograms splendiferousness preneural raploch noncontin..
“copernicus”
“overglanced”
“consists”
“epicure casern mythologization”
“angiosymphysis”
“shellfishes renniogen lactonic”
..
/define raw as this file
q)raw:read0 `:t.txt
/split the first row on spaces
q)” ” vs first raw
“cellulipetal”
“outname”
“kymograms”
“splendiferousness”
“preneural”
“raploch”
“noncontingency”
..
/cast each word to a symbol
q)`$” ” vs first raw
`cellulipetal`outname`kymograms`splendiferousness`preneural`raploch…
/apply this function to each row in raw and call it sraw
q)sraw:{`$” ” vs x} each raw
/this takes about 19 seconds
q)\t sraw:{`$” ” vs x} each raw
18744
/sraw looks like this
q)sraw
`cellulipetal`outname`kymograms`splendiferousness`preneural`raploch..
,`copernicus
,`overglanced
,`consists
`epicure`casern`mythologization
,`angiosymphysis
`shellfishes`renniogen`lactonic
..
/create an index column that tracks each row (add 1 so that it will agree with grep)
q)index:1+til count sraw
q)index
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29..
/join together each row number with every symbol in that row
q)flatrows:raze index {x,’raze y}’ sraw
q)\t flatrows:raze index {x,’raze y}’ sraw
7634
/flatrows looks like this:
q)flatrows
1 `cellulipetal
1 `outname
1 `kymograms
1 `splendiferousness
1 `preneural
1 `raploch
1 `noncontingency
..
/create a new table that has a rid and a token
q)t2:([]index:flatrows[;0]; token:flatrows[;1])
q)\t t2:([]index:flatrows[;0]; token:flatrows[;1])
4631
/t2 looks like this:
q)t2
index token
———————–
1 cellulipetal
1 outname
1 kymograms
1 splendiferousness
1 preneural
..
/aggregate and get all indexes for each token
/this takes less than 8 seconds
q)\t res2:select index by token from t2
7741
/res2 looks like
q)res2
token| index ..
—–| ———————————————————————-..
1080 | 1283 127164 391580 443156 543775 561432 608795 862810 889780 1029079 1..
10th | 27747 57768 238851 293795 332990 487295 564854 732020 755630 834528 83..
1st | 37915 113165 175561 203936 213171 318590 382626 411531 473224 653353 6..
2 | 26642 67272 108161 122598 192496 211019 250924 281576 300950 344445 45..
2nd | 3275 50750 93226 99089 107105 208955 212762 234366 238134 403142 48216
/total time without ever leaving kdb: 19+8+5+8=40 seconds!!
/If you are not interested in all the tokens you can easily subselect
/as an example asked for a 1000 random subsample of tokens from res2
q)sampleTokens:neg[1000]?(0!res2)[`token]
/sampleTokens looks like this
q)sampleTokens
`bimillionaire`abarthrosis`krone`volleyers`correct`pitarah`aeroelastic`zilch`..
/I can ask for res2 where the token is in sampleTokens
q)select from res2 where token in sampleTokens
token | index ..
————| —————————————————————..
abaciscus | 19644 81999 125546 145360 159922 213205 223650 227560 353332 41..
abactinally | 92704 265473 480112 499419 510784 539605 617031 668847 752500 8..
abarthrosis | 25638 61103 215367 281385 305352 385693 553059 585997 598293 66..
abir | 96166 214361 258595 259398 309117 387755 419601 720156 746810 8..
abstemiously| 4611 27043 61983 72530 140138 152631 227466 272363 278554 31059..
..
/this takes a millisecond for 1000 tokens
q)\t select from res2 where token in sampleTokens
1
So if you know that everything is in ASCII and you know that the words in the description are space delimited nothing beats KDBQ. Arthur you win. I did think to tokenize my inputs, but I didn’t think to wrap them with an index number, until I was collecting the results.
A disclaimer where I suggest using ripgrep until grep reintroduces mmap for large files. Old versions of grep have bugs that are not documented all that well.
On the other hand if you need UTF-8 use ripgrep and if you need regular expressions use ripgrep. The recent lack of support for –mmap in grep means that the chance of doing better with it requires you to build it yourself and deal with bugs.
(It turns out that grep2.5.1 has a bug where -w will not match word boundaries correctly and 2.5.4 fixes that bug but introduces a performance penalty by printing every line, for every match – a task much better suited to AWK. To work around this and still get the same performance, you either need to add spaces to your matches or turn off -w and deal with the overlap matches that will inevitably occur.) I ended up using the command below that mostly did away with the performance penalty, but was still slower than ripgrep.
On the other hand, the bug I found in ripgrep while writing this post was fixed within 14 hours of me reporting it. A serious benefit of a tool that is actively supported.
time cat tokens.txt | parallel –pipe –block 1000k –round-robin grep2.5.1 -Ff – –mmap -on t.txt |awk -F: ‘NF==2{line=$1} NF==1{$0=line FS $0}1’ >results/test
real 4m41.146s
user 15m35.820s
sys 0m8.096s
I didn’t get a chance to fully optimize the block size, but it’s within reach of ripgrep.
GNU Parallel was used:
O. Tange (2011): GNU Parallel – The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.