Python: How To Remove Range Of Characters \x91\x87\xf0\x9f\x91\x87 From File
Solution 1:
Your string is valid utf-8
. Therefore it can be directly converted to a python string.
You can then encode it to ascii
with str.encode(). It can ignore non-ascii characters with 'ignore'
.
Also possible: 'replace'
line_raw = b'Who\xe2\x80\x99s he?'
line = line_raw.decode('utf-8')
print(repr(line))
print(line.encode('ascii', 'ignore'))
print(line.encode('ascii', 'replace'))
'Who’s he?'
b'Whos he?'
b'Who?s he?'
To come back to your original question, your 3rd method was correct. It was just in the wrong order.
code3 = line.decode("utf-8").encode('ascii', 'ignore')
print(code3)
To finally provide a working pandas example, here you go:
import pandas
df = pandas.read_csv('test.csv', encoding="utf-8")
for index, row in df.iterrows():
print(row['text'].encode('ascii', 'ignore'))
There is no need to do decode('utf-8')
, because pandas does that for you.
Finally, if you have a python string that contains non-ascii characters, you can just strip them by doing
text = row['text'].encode('ascii', 'ignore').decode('ascii')
This converts the text to ascii bytes, strips all the characters that cannot be represented as ascii, and then converts back to text.
You should look up the difference between python3 strings and bytes, that should clear things up for you, I hope.
Post a Comment for "Python: How To Remove Range Of Characters \x91\x87\xf0\x9f\x91\x87 From File"