This is a sketch of how you can take user input fields from a survey with free text and try to do some kind of grouping of the data. I will describe my strategy but first a warning. This code should not really be published, but the idea could be interesting.
WARNING: The following code is
- not performance optimized, AT ALL
- makes use of global variables as a hack
- generates different quality groups based on the random order of the input list
Alright, you’re still reading.
I came across this problem when dealing with classification of service tickets. Each service ticket represents a unique real world event and it would be nice to classify the events. In particular we are interested in finding a group of incidents that are in some way common such that we can avoid them in the future (optimistic little creatures we are). Each of these tickets comes with a free-text field that describes the resolution, in a few sentences. Humans are pretty lazy, so they are often copying from previous resolution descriptions, which at least makes it easier to group similar tickets, (of course it leaves open the possibility that the resolution description is irrelevant to the incident). However, humans are also careless so when they copy and paste, they may leave off some parts, or simply type something similar.
My strategy was to use the fuzzywuzzy library kindly developed and open-sourced by seatgeek. The library has several useful features.
Firstly it can find the ratio between two strings and it has several methods available. One has to know a little about their domain to determine which strategy to use. For example:
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100
fuzz.token_sort_ratio(\ "New York Mets vs Atlanta Braves", \ "Atlanta Braves vs New York Mets") ⇒ 100
fuzz.token_set_ratio(\ "mariners vs angels",\ "los angeles angels of anaheim at seattle mariners") ⇒ 90
Secondly, given a particular strategy and a list of words fuzzy wuzzy can try to extract the best inexact matches and it can use both a hard limit on the number of items to return and a minimum ratio to be considered a match.
Given a dataset, pandas and fuzzywuzzy we can group some data.
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz, process
First let’s clean the data a bit:
df = pd.read_csv(“dataset.csv”) df.userinput=df.userinput.apply(fuzz.utils.asciidammit)
We need to setup the list of items to group: choices = list(df.userinput) Unfortunately, this list is a global variable so that as we progress choices that have been added to the groups previously don’t get added again. In theory, you can get a serious performance increase from removing them from the query, but that requires adding an extra bit of branching logic in the extract function. (maybe later)
Next, we create an extract function that will take only a token and return a Boolean array that masks those elements not in it’s group. This way we can take advantage of pandas apply functionality. We also remove from the choices variable those that we have already grouped.
def extract(s): global choices res = process.extractBests(s, choices, \ scorer=fuzz.token_sort_ratio, score_cutoff=80 , limit=100) r = [x for x in res] choices = [x for x in choices if x not in r] return list(df.userinput.isin(r))
Now we create our masks
masks = df.userinput.apply(extract)
We can then use our mask to filter and see all the elements that are similar to a particular item. For example, this gives us what matched the first row:
Here is how we can see how many are in each group:
Finally, I present the complete code that includes a method for running on small subsets of the original set until some portion of choices is covered.
choices = list(df.userinput) def extract(s): global choices res = process.extractBests(s, choices, scorer=fuzz.token_sort_ratio, score_cutoff=80 , limit=10000) r = [x for x in res] choices = [x for x in choices if x not in r] return list(df.userinput.isin(r)) i=1 ratio = 1/4.0 while len(choices)>ratio * len(choices): i=i*2 choices choices = list(df.Resolution_Text_ascii) masks = df.Resolution_Text_ascii[:i].apply(extract) masks.apply(np.count_nonzero);
Also here is a possible branching version, but not as clean as I would
like it where it would return the group it’s a member of.
def extract(s): global choices if s in choices: res = process.extractBests(s, choices, scorer=fuzz.token_sort_ratio, score_cutoff=80 , limit=10000) r = [x for x in res] choices = [x for x in choices if x not in r] return list(df.userinput.isin(r)) else: return list()