Skip to content Skip to sidebar Skip to footer

Update A Null Values In Countrycode Column In A Data Frame By Matching Substring Of Country Name Using Python

I have two data frames: Disaster, CountryInfo Disaster has a column country code which has some null values for example: Disaster: 1.**Country** - **Country_code**

Solution 1:

This should do it. You need to change the column names with rename so that both dataframes have the same column names. Then, the difflib module and its get_close_matches method can be used to do a fuzzy match and replace of Country names. Then it is a simple matter of merging the dataframes

import pandas as pd
import numpy as np
import difflib

df1 = pd.DataFrame({'Country' : ['India', 'Afghanistan', 'India', 'United States of America'],
                        'Country_code' : ['Null', 'AFD', 'IND', 'Null']})
df1
                    Country Country_code
0                     India         Null
1               Afghanistan          AFD
2                     India          IND
3  United States of America         Null

df2 = pd.DataFrame({'Country' : ['India', 'Afghanistan', 'India', 'United States'],
                    'ISO' : ['IND', 'AFD', 'IND', 'USA']})
df2
          Country ISO
0          India  IND
1    Afghanistan  AFD
2          India  IND
3  United States  USA

df2.rename(columns={'ISO' : 'Country_code'}, inplace=True)
df2
         Country Country_code
0          India          IND
1    Afghanistan          AFD
2          India          IND
3  United States          USA

The following code will change the Country column in df2 with the names in the Country column in df1 that provide the closest match. This is a way of performing a kind of "fuzzy join" on the substrings.

df1['Country'] = df1.Country.map(lambda x: difflib.get_close_matches(x, df2.Country)[0])
df1
         Country Country_code
0          India         Null1    Afghanistan          AFD
2          India          IND
3  United States         Null

Now you can simply merge the dataframes, which will update missing Country_code rows in df1.

df1.merge(df2, how='right', on=['Country', 'Country_code'])

         Country Country_code
0    Afghanistan          AFD
1          India          IND
2          India          IND
3  United States          USA

Post a Comment for "Update A Null Values In Countrycode Column In A Data Frame By Matching Substring Of Country Name Using Python"