Lebowtech Blog

Transliterate With Python

| Comments

I bought my kids a new CD player that also takes a USB drive with MP3 files.

The problem is that many of the file names are in Hebrew and the MP3 player can’t read Hebrew file names, it just shows up as XXXXXXXXXXXXXX.mp3 and that’s not very helpful.

I found a nice module for python that transliterates Unicode to ASCII, I believe using Iconv,Unidecode.

The installation was straight forward. The tests failed, but the module seemed to work anyway.

Convert all filenames in the current directory to ASCII from Unicode
1
2
3
4
5
6
7
from unidecode import unidecode
import os
#use 'u' to get unicode file names
for fn in os.listdir(u"."):
   print fn
   os.rename(fn,unidecode(fn))
   print fn

One catch was how to get the file names in Unicode. Using the path name for listdir as Unicode, gets the file names in Unicode.

Works decent, but not 100%. I’m not sure who set up the transliteration rules but we get some funny results, for Hebrew at least.

converting unicode strings
1
2
3
4
5
6
7
8
9
10
11
12
>>> print unidecode(u'בוקר טוב ילדים. מה נישמע מה שלומכם? אנחנו תמיד אמרים ברוך ה')
bvqr tvb yldym. mh nyshm` mh shlvmkm? nkhnv tmyd mrym brvk h
>>> print unidecode(u"щи да каша – пища наша")
shchi da kasha - pishcha nasha
>>> print unidecode(u"יִשרָאֵל")
yishrael
>>> print unidecode(u"יִשׂרָאֵל")
yishrael
>>> print unidecode(u"נוּ")
nvv'
>>> print unidecode(u"לוֹ")
lvo

Still more informative than XXXXXXXXXXXXXX.mp3.

Comments