Update A Null Values In Countrycode Column In A Data Frame By Matching Substring Of Country Name Using Python
I have two data frames: Disaster, CountryInfo Disaster has a column country code which has some null values for example: Disaster: 1.**Country** - **Country_code**
Solution 1:
This should do it. You need to change the column names with rename
so that both dataframes
have the same column names. Then, the difflib
module and its get_close_matches
method can be used to do a fuzzy match and replace of Country
names. Then it is a simple matter of merging the dataframes
import pandas as pd
import numpy as np
import difflib
df1 = pd.DataFrame({'Country' : ['India', 'Afghanistan', 'India', 'United States of America'],
'Country_code' : ['Null', 'AFD', 'IND', 'Null']})
df1
Country Country_code
0 India Null
1 Afghanistan AFD
2 India IND
3 United States of America Null
df2 = pd.DataFrame({'Country' : ['India', 'Afghanistan', 'India', 'United States'],
'ISO' : ['IND', 'AFD', 'IND', 'USA']})
df2
Country ISO
0 India IND
1 Afghanistan AFD
2 India IND
3 United States USA
df2.rename(columns={'ISO' : 'Country_code'}, inplace=True)
df2
Country Country_code
0 India IND
1 Afghanistan AFD
2 India IND
3 United States USA
The following code will change the Country
column in df2
with the names in the Country
column in df1
that provide the closest match. This is a way of performing a kind of "fuzzy join" on the substrings.
df1['Country'] = df1.Country.map(lambda x: difflib.get_close_matches(x, df2.Country)[0])
df1
Country Country_code
0 India Null1 Afghanistan AFD
2 India IND
3 United States Null
Now you can simply merge
the dataframes
, which will update missing Country_code
rows in df1
.
df1.merge(df2, how='right', on=['Country', 'Country_code'])
Country Country_code
0 Afghanistan AFD
1 India IND
2 India IND
3 United States USA
Post a Comment for "Update A Null Values In Countrycode Column In A Data Frame By Matching Substring Of Country Name Using Python"