i cannot work! have text file save game file parser bunch of utf-8 chinese names in in byte form, in source.txt:
\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89
but, no matter how import python (3 or 2), string, @ best:
\\xe6\\x89\\x8e\\xe5\\x8a\\xa0\\xe6\\x8b\\x89
i have tried, other threads have suggested, re-encode string utf-8 , decode unicode escape, so:
stringname.encode("utf-8").decode("unicode_escape")
but messes original encoding, , gives string:
'æ\x89\x8eå\x8a\xa0æ\x8b\x89' (printing string results in: æå æ )
now, if manually copy , paste b + original string in filename , encode this, correct encoding. example:
b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'.encode("utf-8")
results in: '扎加拉'
but, can't programmatically. can't rid of double slashes.
to clear, source.txt contains single backslashes. have tried importing in many ways, common:
with open('source.txt','r',encoding='utf-8') f_open: source = f_open.read()
okay, clicked answer below (i think), here works:
from ast import literal_eval decodedstring = literal_eval("b'{}'".format(stringvariable)).decode('utf-8')
i can't use on whole file because of other encoding issues, extracting each name string (stringvariable) , doing works! thank you!
to more clear, original file not these messed utf encodings. uses them fields. example, here beginning of file:
{'m_cachehandles': ['s2ma\x00\x00cn\x1f\x1b"\x8d\xdb\x1fr \\\xbf\xd4d\x05r\x87\x10\x0b\x0f9\x95\x9b\xe8\x16t\x81b\xe4\x08\x1e\xa8u\x11', 's2ma\x00\x00cn\x1a\xd9l\x12n\xb9\x8al\x1d\xe7\xb8\xe6\xf8\xaa\xa1s\xdb\xa5+\t\xd3\x82^\x0c\x89\xdb\xc5\x82\x8d\xb7\x0fv', 's2ma\x00\x00cn\x92\xd8\x17d\xc1d\x1b\xf6(\xedj\xb7\xe9\xd1\x94\x85\xc8`\x91m\x8btz\x91\xf65\x1f\xf9\xdc\xd4\xe6\xbb', 's2ma\x00\x00cn\xa1\xe9\xab\xcd?\xd2ps\xc9\x03\xab\x13r\xa6\x85u7(k2\x9d\x08\xb8k+\xe2\xdei\xc3\xab\x7fc', 's2ma\x00\x00cnn\xa5\xe7\xaf\xa0\x84\xe5\xbc\xe9hx\xb93s*sj\xe3\xf8\xe7\x84`\xf1ye\x15~\xb93\x1f\xc90', 's2ma\x00\x00cn8\xc6\x13f\x19\x1f\x97ah\xfa\x81m\xac\xc9\xa6\xa8\x90s\xfdd\x06\rl]z\xbb\x15\xdci\x93\xd3v'], 'm_campaignindex': 0, 'm_defaultdifficulty': 7, 'm_description': '', 'm_difficulty': '', 'm_gamespeed': 4, 'm_imagefilepath': '', 'm_isblizzardmap': true, 'm_mapfilename': '', 'm_minisave': false, 'm_modpaths': none, 'm_playerlist': [{'m_color': {'m_a': 255, 'm_b': 255, 'm_g': 92, 'm_r': 36}, 'm_control': 2, 'm_handicap': 0, 'm_hero': '\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89',
all of information before 'm_hero': field not utf-8. using shadowranger's solution works if file made of these fake utf-encodings, doesn't work when have parsed m_hero string , try convert that. karin's solution work that.
i'm assuming you're using python 3. in python 2, strings bytes default, work you. in python 3, strings unicode , interpretted unicode, makes problem harder if have byte string being read unicode.
this solution inspired mgilson's answer. can literally evaluate unicode string byte string using literal_eval
:
from ast import literal_eval open('source.txt', 'r', encoding='utf-8') f_open: source = f_open.read() string = literal_eval("b'{}'".format(source)).decode('utf-8') print(string) # 扎加拉
Comments
Post a Comment