Running Python 2.7 Code With Unicode Characters In Source

March 24, 2024 Post a Comment

I want to run a Python source file that contains unicode (utf-8) characters in the source. I am aware of the fact that this can be done by adding the comment # -*- coding: utf-8 -*

Solution 1:

Your idea is generally sound but will break in Python 3 and will cause a headache when you manipulating and writing your strings in Python 2.

It's a good idea to use Unicode strings, not regular strings when dealing with non-ASCII.

Instead, you can encode your characters as Unicode (not UTF-8) escape sequences in Unicode strings.

u'na\xefve'u'\u7537\u5b69'

note the u prefix

Your code is now encoding agnostic.

Solution 2:

If you only use byte strings, and save your source file encoded as UTF-8, your byte strings will contain UTF-8-encoded data. No need for the coding statement (although REALLY strange that you don't want to use it...it's just a comment). The coding statement let's Python know the encoding of the source file, so it can decode Unicode strings correctly (u'xxxxx'). If you have no Unicode strings, it doesn't matter.

For your questions, no need to convert to escape codes. If you encode the file as UTF-8, you can use the more readable characters in your byte strings.

FYI, that won't work for Python 3, because byte strings cannot contain non-ASCII in that version.

That said, here's some code that will convert your example as requested. It reads the source assuming it is encoded in UTF-8, then uses a regular expression to locate all non-ASCII characters. It passes them through a conversion function to generate the replacement. This should be safe, since non-ASCII can only be used in string literals and constants in Python 2. Python 3, however, allows non-ASCII in variable names so this wouldn't work there.

import io
import re

defescape(m):
    char = m.group(0).encode('utf8')
    return''.join(r'\x{:02x}'.format(ord(b)) for b in char)

with io.open('sample.py',encoding='utf8') as f:
    content = f.read()

new_content = re.sub(r'[^\x00-\x7f]',escape,content)

with io.open('sample_new.py','w',encoding='utf8') as f:
    f.write(new_content)

Result:

# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9deffxn():
    print'na\xc3\xafve'print'\xe7\x94\xb7\xe5\xad\xa9'
fxn()

Solution 3:

question 1:

try to use:

print u'naïve'

Baca Juga

print u'长者'

question 2:

If you type the sentences by keyboard and Chinese input software, everything should be OK. But if you copy and paste sentence from some web pages, you should consider other encode format such as GBK,GB2312 and GB18030

Solution 4:

This snippet of Python 3 should convert your program correctly to work in Python 2.

defconvertchar(char): #converts individual charactersif32<=ord(char)<=126or char=="\n": return char #if normal character, return it
    h=hex(ord(char))[2:]
    iford(char)<256: #if unprintable ASCII
        h=" "*(2-len(h))+h
        return"\\x"+h
    eliford(char)<65536: #if short unicode
        h=" "*(4-len(h))+h
        return"\\u"+h
    else: #if long unicode
        h=" "*(8-len(h))+h
        return"\\U"+h

defconverttext(text): #converts a chunk of text
    newtext=""for char in text:
        newtext+=convertchar(char)
    return newtext

defconvertfile(oldfilename,newfilename): #converts a file
    oldfile=open(oldfilename,"r")
    oldtext=oldfile.read()
    oldfile.close()
    newtext=converttext(oldtext)
    newfile=open(newfilename,"w")
    newfile.write(newtext)
    newfile.close()

convertfile("FILE_TO_BE_CONVERTED","FILE_TO_STORE_OUTPUT")

Solution 5:

First a simple remarl: as you are using byte strings in a Python2 script, the # -*- coding: utf-8 -*- has simply no effect. It only helps to convert the source byte string to an unicode string if you had written:

# -*- coding: utf-8 -*-
...
utxt = u'naïve'# source code is the bytestring `na\xc3\xafve'# but utxt must become the unicode string u'na\xefve'

Simply it might be interpreted by clever editors to automatically use a utf8 charset.

Now for the actual question. Unfortunately, what you are asking for is not really trivial: idenfying in a source file what is in a comment and in a string simply requires a Python parser... And AFAIK, if you use the parser of ast modules you will lose your comments except for docstrings.

But in Python 2, non ASCII characters are only allowed in comments and litteral strings! So you can safely assume that if the source file is a correct Python 2 script containing no litteral unicode string(*), you can safely transform any non ascii character in its Python representation.

A possible Python function reading a raw source file from a file object and writing it after encoding in another file object could be:

defsrc_encode(infile, outfile):
    whileTrue:
        c = infile.read(1)
        iflen(c) < 1: break# stop on end of fileiford(c) > 127:      # transform high characters
            c = "\\x{:2x}".format(ord(c))
        outfile.write(c)

An nice property is that it works whatever encoding you use, provided the source file is acceptable by a Python interpreter and does not contain high characters in unicode litterals(*), and the converted file will behave exactly the same as the original one...

(*) A problem will arise if you use unicode litterals in an encoding other that Latin1, because the above function will behave as if the file contained the declaration # -*- coding: Latin1 -*-: u'é' will be translated correctly as u'\xe9' if original encoding is latin1 but as u'\xc3\xc9' (not what is expected...) if original encoding is utf8, and I cannot imagine a way to process correctly both litteral byte strings and unicode byte strings without fully parsing the source file...

Python Programming Language