Skip to content Skip to sidebar Skip to footer

How To Replace Invalid Unicode Characters In A String In Python?

As far as I know it is the concept of python to have only valid characters in a string, but in my case the OS will deliver strings with invalid encodings in path names I have to de

Solution 1:

If you have a bytestring (undecoded data), use the 'replace' error handler. For example, if your data is (mostly) UTF-8 encoded, then you could use:

decoded_unicode = bytestring.decode('utf-8', 'replace')

and U+FFFD � REPLACEMENT CHARACTER characters will be inserted for any bytes that can't be decoded.

If you wanted to use a different replacement character, it is easy enough to replace these afterwards:

decoded_unicode = decoded_unicode.replace(u'\ufffd', '#')

Demo:

>>>bytestring = 'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'>>>bytestring.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 5: invalid start byte
>>>bytestring.decode('utf8', 'replace')
u'F\xf8\xf6\ufffdB\xe5r'
>>>print bytestring.decode('utf8', 'replace')
Føö�Bår

Solution 2:

Thanks to you for your comments. This way I was able to implement a better solution:

try:
        s2 = codecs.encode(s, "utf-8")
        return (True, s, None)
    except Exception as e:
        ret = codecs.decode(codecs.encode(s, "utf-8", "replace"), "utf-8")
        return (False, ret, e)

Please share any improvements on that solution. Thank you!

Solution 3:

You have not given an example. Therefore, I have considered one example to answer your question.

x='This is a cat which looks good 😊'print x
x.replace('😊','')

The output is:

This is a catwhich looks good 😊
'This is a cat which looks good '

Solution 4:

The right way to do it (at least in python2) is to use unicodedata.normalize:

unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')

decode('utf-8', 'ignore') will just raise exception.

Post a Comment for "How To Replace Invalid Unicode Characters In A String In Python?"