Searching One Python Dataframe / Dictionary For Fuzzy Matches In Another Dataframe
I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns): df1: PRODUCT_ID PRODUCT_DESCRIPT
Solution 1:
using fuzz.ratio
as my distance metric, calculate my distance matrix like this
df3 = pd.DataFrame(index=df.index, columns=df2.index)
for i in df3.index:
for j in df3.columns:
vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
vj = df2.get_value(j, 'PROD_DESCRIPTION')
df3.set_value(
i, j, fuzz.ratio(vi, vj))
print(df3)
0 1 2 3 4 5
0 63 15 24 23 34 27
1 26 84 19 21 52 32
2 18 31 33 12 35 34
3 10 31 35 10 41 42
4 29 52 32 10 42 12
5 15 28 21 49 8 55
Set a threshold for acceptable distance. I set 50
Find the index value (for df2
) that has maximum value for every row.
threshold = df3.max(1) > 50
idxmax = df3.idxmax(1)
Make assignments
df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
df
Solution 2:
You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:
d = {
'df1_id': [],
'df1_prod_desc': [],
'df2_id': [],
'df2_prod_desc': [],
'fuzzywuzzy_sim': []
}
for _, df1_row in df1.iterrows():
for _, df2_row in df2.iterrows():
d['df1_id'] = df1_row['PRODUCT_ID']
...
df3 = pd.DataFrame.from_dict(d)
Solution 3:
I don't have enough reputation to be able to comment on answer from @piRSquared. Hence this answer.
- The definition of 'vi' and 'vj' didn't go through with an error (
AttributeError: 'DataFrame' object has no attribute 'get_value'
). It worked when I inserted an "underscore". E.g.vi = df._get_value(i, 'PRODUCT_DESCRIPTION')
- Similar issue persisted for '
set_value
' and the same solution worked there too. E.g.df3._set_value(i, j, fuzz.ratio(vi, vj))
- Generating
idxmax
posed another error (TypeError: reduction operation 'argmax' not allowed for this dtype
) which was because contents of df3 (the fuzzy ratios) were of type 'object'. I converted all of them to numeric just before definingthreshold
and it worked. E.g.df3 = df3.apply(pd.to_numeric)
A million thanks to @piRSquared for the solution. For a Python novice like me, it worked like a charm. I am posting this answer to make it easy for other newbies like me.
Post a Comment for "Searching One Python Dataframe / Dictionary For Fuzzy Matches In Another Dataframe"