[python] tuple to list, unicode的處理

November 12, 2009

1)在使用 *args時需要加入其他參數, 發現他是tuple type,
可是tuple無任何method, 所以需要將其轉為list, 再轉回 tuple

a = ('abc', 334, 21.21)
b = list(a)
b.append('glob')
a2=tuple(b)

2) python 的 default encoding 是 unicode (記得是2.4以後統一的規格)
這裡有詳細的介紹. unicode 可以算是其他編碼互換的中間碼,
其格式是
上面說得很清楚, 有分成4 digits 或 8digits的unicode, 缺項補 0

The \u escape sequence is used to denote Unicode codes.
This is somewhat like the traditional C-style \xNN to insert binary values. However, a glance at the Unicode table shows values with up to 6 digits. These cannot be represented conveniently by \xNN, so \u was invented.
For Unicode values up to (and including) 4 digits, use the 4-digit version:
\uNNNN
Note that you must include all 4 digits, using leading 0's as needed.
For Unicode values longer than 4 digits, use the 8-digit version:
\UNNNNNNNN
Note that you must include all 8 digits, using leading 0's as needed.

一般其他 encoding-format 轉 unicode的方式
afterencoded_str = unicode(preencoded_str, format)
若要知道欲轉換的字串是哪種編碼可以使用 universal encoding detector 來檢驗

不過最近遇到的問題是從 web 送來的字串(ex. c:\\我的文件\song_江蕙.mp3), 因為經過escape(),
python 又沒有相對應的方式來unescape, 所以需要自己做處理...
目前最好的方式是自己parse = =
以 c:\\我的文件\song_江蕙.mp3 作例子
他的 utf8-format 是 'c:\\\xa7\xda\xaa\xba\xa4\xe5\xa5\xf3\\song_\xa6\xbf\xbf\xb7.mp3'
unicode-format 是 u'c:\\\u6211\u7684\u6587\u4ef6\\song_\u6c5f\u8559.mp3'
經過escape() 後
str = 'c:\\%u6211%u7684%u6587%u4ef6\\song_%u6c5f%u8559.mp3'
items = str.split('%u')

unicoded = ()
unicoded.append(items[0])
for item in items[1:]:
if len(item) == 4:
    s=unichr(int(item, 16))
else:
    s='%s%s' % (unichr(int(item[:4],16)), item[4:])
    # 在這裡再處理 %U  ....
unicoded.append(s)
correct_str = ''.join(unicoded)

*a) escape()會將 \u 變成 %u
*b) 未處理%U的情形 (也不知道甚麼時候會遇到 8 digits... 不過還是小心點好了 = =)

Search This Blog

JOGG's

[python] tuple to list, unicode的處理

Comments

Post a Comment

Popular posts from this blog

股票評價(Stock Valuation) - 股利折現模型

openwrt feed的使用

R 語言：邏輯回歸 Logistic Regression using R language （二）