Author Topic: Unicode using urxvt and Python (Japanese) (Read 6239 times)

KHarvey · « **on:** June 23, 2012, 05:43:55 PM »

I have determined that I have no clue what I am doing.

Currently I can't tell if I have Unicode support or not in urxvt and I do not know how to test it. My Python scripts fail (see below) and if I try to copy and paste Japanese into the console I just get question marks.

I will be running TC 4.5.5, Xorg, Fluxbox.

What I am attempting to do is to write a simple little Python script that translates English into Japanese characters. It will all run in a terminal.

In Python when I try to do a print of:

Code: [Select]

print chr(0x3040)I receive an error:
ValueError: chr() arg not in range(256)

When I try to do a print of

Code: [Select]

print u"\u4e2d"I receive an error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e2d' in position 0: ordinal not in range(128)

I am pretty sure both of these errors mean that the console that I am using does not support Unicode, and I haven't figured out a way to setup a console that does support Unicode.
I would like to keep my keyboard (keymap) and locale set to US English, but I want the ability to display Unicode characters. Possibly in the future I may want to be able to change my keymap, but that will be a future thing that I am not worried about now.

The first thing that I tried was adding an option to my Xorg:

Code: [Select]

Section "InputDevice"
Identifier "Keyboard0"
Driver "kbd"
Option "XkbLayout" "us,jp"
EndSection

But that does not appear to have worked or done anything. So far everything that I have been reading online has been geared around changing the keymap and the locale to other languages. I just need to be able to display the characters in the console. At this point I do not know where I am stuck at, or where to go from here. So any advice would be greatly appreciated.
When reading the README.xorg.conf it says that Xorg-7.5.tcz has all keyboard data included. Which means I just need to change the XkbLayout. So in theory I have the capability to do what I need to do, I just don't know how to do it.

Does anyone have any suggestions, or can point me in a direction and I can continue to try and do research? I really thought that this would be a fairly simple project

Rich · « **Reply #1 on:** June 23, 2012, 10:03:10 PM »

Hi KHarvey

Quote

ValueError: chr() arg not in range(256)

That's probably because you are passing a 16 bit value to an 8 bit character set.

Quote

UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e2d' in position 0: ordinal not in range(128)

Do a Google of UnicodeEncodeError: 'ascii' codec can't encode character for that one.

solorin · « **Reply #2 on:** June 23, 2012, 10:18:35 PM »

I, too, tried and failed.

http://forum.tinycorelinux.net/index.php/topic,12999.0.html

KHarvey · « **Reply #3 on:** June 23, 2012, 10:43:27 PM »

Okay, I think that I have made it a little farther, or at least I understand a bit more.

Rich, the char set that is available in urxvt (using Python) ends at 00FF. When I try to print 0100 (256'th) character it fails stating it is out of range. But I am able to print all of the others. So it appears it supports everything through Latin-1 Supplement, but nothing past that. Is there a way to extend the character set? Is this a font limitation or an encoding limitation?

solorin, yeah I read your post several times, and then a couple more times after that. It appears that you were trying to solve the problem using fonts. Which I haven't figured out will work or not. I have several Japanese fonts installed, and they work in Chrome and Firefox, but they don't appear to work any where else. Also I haven't found a way to get those fonts added to the xlsfont list.

Well at least my original frustration has worn off. I'll keep doing more reading and research. I still think that I don't quite understand what I am attempting to do, which is probably why I am unable to find a solution.

KHarvey · « **Reply #4 on:** June 23, 2012, 10:54:16 PM »

After reading some more it appears that my Python errors are being generated because my output (urxvt) only supports ASCII rather than unicode.

So I think I am doing something wrong when I start urxvt. To start urxvt I am just running urxvt from aterm without any switches.

I knew I forgot something. I am able to run the exact same characters through aterm that I am urxvt (0x00FF).

Rich · « **Reply #5 on:** June 24, 2012, 01:41:01 AM »

Hi KHarvey
Have you checked the man files that come with urxvt? Make sure you have man.tcz installed and enter
man urxvt for the urxvt manual and man 7 urxvt for the FAQ manual.

bmarkus · « **Reply #6 on:** June 24, 2012, 01:50:37 AM »

It is not a terminal but Python issue. While I like Python and using a lot, Unicode and national characters are causing always a headache. Read Pyton docs handling unicode and the .decode() .encode() string operators.

solorin · « **Reply #7 on:** June 24, 2012, 02:34:29 AM »

@kharvey

i think there are a lot of pieces to the solution.
i feel setting the locale and fonts are definitely parts that do have to be in place.

I believe launching urxvt or urxvtc (w/ urxvtd running of course) without specifying a font will definitely preclude it from displaying characters correctly.
I believe urxvt can work with both true type fonts and bitmap fonts (the kind the xlsfonts will display). to work with true type fonts you should call urvxt with a switch like

Code: [Select]

urxvtc -fn "xft:mingliu"

after that a first step might be to see if you can't get ls(from core-utils) to display filenames with unicode characters in them correctly like in the thread referenced in my thread above.
(reproduced here for convenience).
http://forum.tinycorelinux.net/index.php/topic,8506.msg45652.html#msg45652

@ bmarkus,
how do you know it's not also a terminal issue?
does displaying unicode characters in a terminal work for you?

cheerio,
solorin

bmarkus · « **Reply #8 on:** June 24, 2012, 03:48:22 AM »

Quote from: solorin on June 24, 2012, 02:34:29 AM

@kharvey

i think there are a lot of pieces to the solution.
i feel setting the locale and fonts are definitely parts that do have to be in place.

I believe launching urxvt or urxvtc (w/ urxvtd running of course) without specifying a font will definitely preclude it from displaying characters correctly.
I believe urxvt can work with both true type fonts and bitmap fonts (the kind the xlsfonts will display). to work with true type fonts you should call urvxt with a switch like

Code: [Select]
urxvtc -fn "xft:mingliu"
after that a first step might be to see if you can't get ls(from core-utils) to display filenames with unicode characters in them correctly like in the thread referenced in my thread above.
(reproduced here for convenience).
http://forum.tinycorelinux.net/index.php/topic,8506.msg45652.html#msg45652

@ bmarkus,
how do you know it's not also a terminal issue?
does displaying unicode characters in a terminal work for you?

cheerio,
solorin

I have been using urxvt for a time with UTF-8 locale to get national character set. It was working fine. So I expect urxvt itself supports UTF-8, however I have no experience with asian sets.

curaga · « **Reply #9 on:** June 24, 2012, 05:19:51 AM »

@KHarvey

Indeed it looks like Python error and not the terminal's: http://docs.python.org/howto/unicode.html
http://www.saltycrane.com/blog/2008/11/python-unicodeencodeerror-ascii-codec-cant-encode-character/

As a simpler test of the terminal, I'd try copy-pasting a Japanese word from a browser, etc.

curaga · « **Reply #10 on:** June 24, 2012, 05:52:56 AM »

So I tried it myself. Seems urxvt is limited when compared to "normal" GUI apps, it needs all symbols in one font (no fallback), and also needs the host locale to be UTF-8.

1. Install a japanese font (I used IPA gothic)
2. tce-load -wi getlocale
3. Pick your local utf-8 locale
4. LANG=my_MY.UTF-8 urxvt -fn xft:ipagothic-10

Doesn't seem to need Xorg, though it complains that the locale is not supported by Xlib. Didn't affect displaying the characters.

Rich · « **Reply #11 on:** June 24, 2012, 09:43:20 AM »

Hi curaga

Quote

it needs all symbols in one font (no fallback),

From man urxvt:

Quote

font: fontlist
Select the fonts to be used. This is a comma separated list of font names that are checked
in order when trying to find glyphs for characters. The first font defines the cell size
for characters; other fonts might be smaller, but not (in general) larger. A (hopefully)
reasonable default font list is always appended to it; option -fn.

Each font can either be a standard X11 core font (XLFD) name, with optional prefix "x:" or
a Xft font (Compile xft), prefixed with "xft:".

In addition, each font can be prefixed with additional hints and specifications enclosed
in square brackets ("[]"). The only available hint currently is "codeset=codeset-name",
and this is only used for Xft fonts.

For example, this font resource

URxvt.font: 9x15bold,\
-misc-fixed-bold-r-normal--15-140-75-75-c-90-iso10646-1,\
-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1, \
[codeset=JISX0208]xft:Kochi Gothic:antialias=false, \
xft:Code2000:antialias=false

specifies five fonts to be used. The first one is "9x15bold" (actually the iso8859-1
version of the second font), which is the base font (because it is named first) and thus
defines the character cell grid to be 9 pixels wide and 15 pixels high.

The second font is just used to add additional unicode characters not in the base font,
likewise the third, which is unfortunately non-bold, but the bold version of the font does
contain less characters, so this is a useful supplement.

The third font is an Xft font with aliasing turned off, and the characters are limited to
the JIS 0208 codeset (i.e. japanese kanji). The font contains other characters, but we are
not interested in them.

The last font is a useful catch-all font that supplies most of the remaining unicode
characters.

From man 7 urxvt:

Quote

If rxvt-unicode first sees a japanese/chinese character, it might choose a japanese font for
display. Subsequent japanese characters will use that font. Now, many chinese characters
aren't represented in japanese fonts, so when the first non-japanese character comes up, rxvt-
unicode will look for a chinese font -- unfortunately at this point, it will still use the
japanese font for chinese characters that are also in the japanese font.

The workaround is easy: just tag a chinese font at the end of your font list (see the previous
question). The key is to view the font list as a preference list: If you expect more japanese,
list a japanese font first. If you expect more chinese, put a chinese font first.

curaga · « **Reply #12 on:** June 24, 2012, 10:26:25 AM »

Compared to any gtk+, qt, or other Xft-using GUI app, which falls back on all installed system fonts automatically.

KHarvey · « **Reply #13 on:** June 24, 2012, 12:17:54 PM »

Quote from: curaga on June 24, 2012, 05:52:56 AM

1. Install a japanese font (I used IPA gothic)
2. tce-load -wi getlocale
3. Pick your local utf-8 locale
4. LANG=my_MY.UTF-8 urxvt -fn xft:ipagothic-10

Brilliant, this worked perfectly when copying and pasting Japanese text. Which was one of the tests that I was using besides Python. I just kept receiving question marks when pasting the text into the terminal before. The two main things (which appear to be the only two things) were to have the locale's installed and to use xft when starting urxvt. I knew that I needed to have something installed, but I didn't know what. It was the getlocale which solved this issue and gave me Unicode support. Also I only knew how to use the -fn with xlsfont names, I didn't know about the xft.

Everyone is also correct in the fact that Python is doing something weird with the Unicode. For some reason Python is only running in ASCII mode. When I do an enumeration of the Unicode characters to get their category I receive an error with anything above 0x00FF or 0x0080.
Since this is now a Python coding problem I should be able to figure the rest out. Thank you all for your help. This appears to be exactly what I needed, you're awesome.

solorin, in theory this should also work for your Chinese fonts as well.

Once again, thank you everyone for your help with this.

KHarvey · « **Reply #14 on:** June 24, 2012, 12:39:09 PM »

If anyone is curious I found the problem with my Python script.

I needed to set my default encoding at the beginning:

Code: [Select]

import sys

reload(sys)
sys.setdefaultencoding('utf-8')

print unichr(0x3057)

And it appears to be working. One issue that I had before was I was trying to print unichr(0x3040) which is a place holder to define the start of a Unicode set. It does not appear to be an actual character. unichr(0x3057) is し (shi) in Japanese.

You guys are awesome, thank you so much.

Tiny Core Linux

News:

Author Topic: Unicode using urxvt and Python (Japanese) (Read 6239 times)

KHarvey

Unicode using urxvt and Python (Japanese)

Rich

Re: Unicode using urxvt and Python (Japanese)

solorin

Re: Unicode using urxvt and Python (Japanese)

KHarvey

Re: Unicode using urxvt and Python (Japanese)

KHarvey

Re: Unicode using urxvt and Python (Japanese)

Rich

Re: Unicode using urxvt and Python (Japanese)

bmarkus

Re: Unicode using urxvt and Python (Japanese)

solorin

Re: Unicode using urxvt and Python (Japanese)

bmarkus

Re: Unicode using urxvt and Python (Japanese)

curaga

Re: Unicode using urxvt and Python (Japanese)

curaga

Re: Unicode using urxvt and Python (Japanese)

Rich

Re: Unicode using urxvt and Python (Japanese)

curaga

Re: Unicode using urxvt and Python (Japanese)

KHarvey

Re: Unicode using urxvt and Python (Japanese)

KHarvey

Re: Unicode using urxvt and Python (Japanese)