Python: How To Remove Range Of Characters \x91\x87\xf0\x9f\x91\x87 From File

June 28, 2023 Post a Comment

I have this file with some lines that contain some unicode literals like: 'b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arr

Solution 1:

Your string is valid utf-8. Therefore it can be directly converted to a python string.

You can then encode it to ascii with str.encode(). It can ignore non-ascii characters with 'ignore'.

Also possible: 'replace'

line_raw =  b'Who\xe2\x80\x99s he?'

line = line_raw.decode('utf-8')
print(repr(line))

print(line.encode('ascii', 'ignore'))
print(line.encode('ascii', 'replace'))

'Who’s he?'
b'Whos he?'
b'Who?s he?'

To come back to your original question, your 3rd method was correct. It was just in the wrong order.

code3 = line.decode("utf-8").encode('ascii', 'ignore')
print(code3)

To finally provide a working pandas example, here you go:

import pandas

df = pandas.read_csv('test.csv', encoding="utf-8")
for index, row in df.iterrows():
    print(row['text'].encode('ascii', 'ignore'))

There is no need to do decode('utf-8'), because pandas does that for you.

Finally, if you have a python string that contains non-ascii characters, you can just strip them by doing

text = row['text'].encode('ascii', 'ignore').decode('ascii')

This converts the text to ascii bytes, strips all the characters that cannot be represented as ascii, and then converts back to text.

You should look up the difference between python3 strings and bytes, that should clear things up for you, I hope.

Python Programming Language

Python: How To Remove Range Of Characters \x91\x87\xf0\x9f\x91\x87 From File

Solution 1:

Post a Comment for "Python: How To Remove Range Of Characters \x91\x87\xf0\x9f\x91\x87 From File"