python - Converting double slash utf-8 encoding -


i cannot work! have text file save game file parser bunch of utf-8 chinese names in in byte form, in source.txt:

\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89

but, no matter how import python (3 or 2), string, @ best:

\\xe6\\x89\\x8e\\xe5\\x8a\\xa0\\xe6\\x8b\\x89

i have tried, other threads have suggested, re-encode string utf-8 , decode unicode escape, so:

stringname.encode("utf-8").decode("unicode_escape") 

but messes original encoding, , gives string:

'æ\x89\x8eå\x8a\xa0æ\x8b\x89' (printing string results in: æå æ )

now, if manually copy , paste b + original string in filename , encode this, correct encoding. example:

b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'.encode("utf-8") 

results in: '扎加拉'

but, can't programmatically. can't rid of double slashes.

to clear, source.txt contains single backslashes. have tried importing in many ways, common:

with open('source.txt','r',encoding='utf-8') f_open:     source = f_open.read() 

okay, clicked answer below (i think), here works:

from ast import literal_eval decodedstring = literal_eval("b'{}'".format(stringvariable)).decode('utf-8') 

i can't use on whole file because of other encoding issues, extracting each name string (stringvariable) , doing works! thank you!

to more clear, original file not these messed utf encodings. uses them fields. example, here beginning of file:

{'m_cachehandles': ['s2ma\x00\x00cn\x1f\x1b"\x8d\xdb\x1fr \\\xbf\xd4d\x05r\x87\x10\x0b\x0f9\x95\x9b\xe8\x16t\x81b\xe4\x08\x1e\xa8u\x11',                 's2ma\x00\x00cn\x1a\xd9l\x12n\xb9\x8al\x1d\xe7\xb8\xe6\xf8\xaa\xa1s\xdb\xa5+\t\xd3\x82^\x0c\x89\xdb\xc5\x82\x8d\xb7\x0fv',                 's2ma\x00\x00cn\x92\xd8\x17d\xc1d\x1b\xf6(\xedj\xb7\xe9\xd1\x94\x85\xc8`\x91m\x8btz\x91\xf65\x1f\xf9\xdc\xd4\xe6\xbb',                 's2ma\x00\x00cn\xa1\xe9\xab\xcd?\xd2ps\xc9\x03\xab\x13r\xa6\x85u7(k2\x9d\x08\xb8k+\xe2\xdei\xc3\xab\x7fc',                 's2ma\x00\x00cnn\xa5\xe7\xaf\xa0\x84\xe5\xbc\xe9hx\xb93s*sj\xe3\xf8\xe7\x84`\xf1ye\x15~\xb93\x1f\xc90',                 's2ma\x00\x00cn8\xc6\x13f\x19\x1f\x97ah\xfa\x81m\xac\xc9\xa6\xa8\x90s\xfdd\x06\rl]z\xbb\x15\xdci\x93\xd3v'], 'm_campaignindex': 0, 'm_defaultdifficulty': 7, 'm_description': '', 'm_difficulty': '', 'm_gamespeed': 4, 'm_imagefilepath': '', 'm_isblizzardmap': true, 'm_mapfilename': '', 'm_minisave': false, 'm_modpaths': none, 'm_playerlist': [{'m_color': {'m_a': 255, 'm_b': 255, 'm_g': 92,   'm_r': 36},                'm_control': 2,                'm_handicap': 0,                'm_hero': '\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89', 

all of information before 'm_hero': field not utf-8. using shadowranger's solution works if file made of these fake utf-encodings, doesn't work when have parsed m_hero string , try convert that. karin's solution work that.

i'm assuming you're using python 3. in python 2, strings bytes default, work you. in python 3, strings unicode , interpretted unicode, makes problem harder if have byte string being read unicode.

this solution inspired mgilson's answer. can literally evaluate unicode string byte string using literal_eval:

from ast import literal_eval  open('source.txt', 'r', encoding='utf-8') f_open:     source = f_open.read()     string = literal_eval("b'{}'".format(source)).decode('utf-8')     print(string)  # 扎加拉 

Comments