Skip to content Skip to sidebar Skip to footer

Searching One Python Dataframe / Dictionary For Fuzzy Matches In Another Dataframe

I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns): df1: PRODUCT_ID PRODUCT_DESCRIPT

Solution 1:

using fuzz.ratio as my distance metric, calculate my distance matrix like this

df3 = pd.DataFrame(index=df.index, columns=df2.index)

for i in df3.index:
    for j in df3.columns:
        vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
        vj = df2.get_value(j, 'PROD_DESCRIPTION')
        df3.set_value(
            i, j, fuzz.ratio(vi, vj))

print(df3)

    0   1   2   3   4   5
0  63  15  24  23  34  27
1  26  84  19  21  52  32
2  18  31  33  12  35  34
3  10  31  35  10  41  42
4  29  52  32  10  42  12
5  15  28  21  49   8  55

Set a threshold for acceptable distance. I set 50
Find the index value (for df2) that has maximum value for every row.

threshold = df3.max(1) > 50
idxmax = df3.idxmax(1)

Make assignments

df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
df

enter image description here


Solution 2:

You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:

d = {
    'df1_id': [],
    'df1_prod_desc': [],
    'df2_id': [],
    'df2_prod_desc': [],
    'fuzzywuzzy_sim': []
}
for _, df1_row in df1.iterrows():
    for _, df2_row in df2.iterrows():
        d['df1_id'] = df1_row['PRODUCT_ID']
        ...
df3 = pd.DataFrame.from_dict(d)

Solution 3:

I don't have enough reputation to be able to comment on answer from @piRSquared. Hence this answer.

  • The definition of 'vi' and 'vj' didn't go through with an error (AttributeError: 'DataFrame' object has no attribute 'get_value'). It worked when I inserted an "underscore". E.g. vi = df._get_value(i, 'PRODUCT_DESCRIPTION')
  • Similar issue persisted for 'set_value' and the same solution worked there too. E.g. df3._set_value(i, j, fuzz.ratio(vi, vj))
  • Generating idxmax posed another error (TypeError: reduction operation 'argmax' not allowed for this dtype) which was because contents of df3 (the fuzzy ratios) were of type 'object'. I converted all of them to numeric just before defining threshold and it worked. E.g. df3 = df3.apply(pd.to_numeric)

A million thanks to @piRSquared for the solution. For a Python novice like me, it worked like a charm. I am posting this answer to make it easy for other newbies like me.


Post a Comment for "Searching One Python Dataframe / Dictionary For Fuzzy Matches In Another Dataframe"