Extract Numbers, Letters, Or Punctuation From Left Side Of String Column In Python
Say I have the following data frame which comes from OCR has company_info column contains numbers, letters, or punctuation and Chinese characters: import pandas as pd  data = '''\
Solution 1:
Use Series.str.extract with DataFrame.pop for extract column:
pat = r'([\x00-\x7F]+)([\u4e00-\u9fff]+.*$)'
df[['office_name','company_info']] = df.pop('company_info').str.extract(pat)
print (df)
   id   office_name         company_info
0   1         05B01  北京企商联登记注册代理事务所(通合伙)
1   2    Unit-D 608     华夏启商(北京企业管理有限公司)
2   3     1004-1005       北京中睿智诚商业管理有限公司
3   4    17/F(1706)        北京美泰德商务咨询有限公司
4   5   A2006~A2007        北京新曙光会计服务有限公司
5   6       2906-10          中国建筑与室内设计师网
Solution 2:
You can use this
^(\d+),\s+([^\u4e00-\u9fff]+).*$
^- Start of string(\d+)- Matches one or more digits,\s+- Matches,followed by one or more space character([^\u4e00-\u9fff]+)- Match anything except chinese character.+- Match anything except new line one or more time$- End of string
Post a Comment for "Extract Numbers, Letters, Or Punctuation From Left Side Of String Column In Python"