WelcomeWelcome | FAQFAQ | DownloadsDownloads | WikiWiki

Author Topic: Puzzled by cifs mount, character encoding and locale (resolved)  (Read 12565 times)

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
Puzzled by cifs mount, character encoding and locale (resolved)
« on: January 14, 2014, 04:24:17 PM »
I had set up a Tiny Core audio player appliance using mpd and GMPC which reads music files from an external usb disk. The set up is described here:
http://personal.nbnet.nb.ca/gavaris/apa.html
The web site indicates an NTFS formatted usb disk, but I had subsequently changed that to a FAT32 formatted disk. Everything worked fine.

Recently, I decided to experiment connecting the usb disk to an Asus router that incorporated a samba server. I loaded the cifs-utils extension and mounted the disk as a network drive. Following instructions from the TC FAQ, I did not specify any iocharset in the mount -t cifs command (I guess it uses some default?). I got a message about mismatched character encoding (my recollection of the message is a bit fuzzy) and mpd/GMPC did not generate a working music database file. I tried again and this time, being naive about TC locale, specified iocharset=utf8 in the mount command. This time the mount completed and mpd/GMPC generated a working music database file.

After experimenting, I decided to have the disk attached directly (initial set up) rather than as a network drive. When I reconnected the disk directly, I noticed that songs with non-ASCII characters in the filename were no longer visible. I suspected that the mpd/database file generated by mpd/GMPC was at fault. I deleted the database file and allowed mpd/GMPC to generate a new one. Those songs were still not visible to mpd/GMPC, though present, meaning I could not play those songs.

It should not be possible for the audio player appliance to have been corrupted since 'no backup' is specified. Indeed, the media containing Tiny Core and the audio extensions is unmounted after loading to RAM. So, I concluded that it must have been the usb drive that somehow got corrupted. To test this, I created a FAT32 formatted usb memory stick with some music that had file names with non-ASCII characters. I was expecting that mpd/GMPC would be able to read these properly and allow me to play them, just like the initial set up. I was wrong. These files were not visible and could not be played. As a further test, I created an ext formatted usb memory stick with the same music files. Now mpd/GMPC could see the files and I could play them. Aterm displays these files with two question marks, ??, in place of the non-ASCII characters. Within GMPC, the non-ASCII characters are displayed properly. For the FAT32 formatted stick or disk, aterm displays only one question mark, ?, in place of the non-ASCII characters. These files are not listed by GMPC.

Through all this, I have not altered the default Tiny Core locale, which is set to language C. I have read that linux is 'unicode aware', even when language is set to C and it is not necessary to set up as utf8.

A solution is to format the external usb drive as ext and copy the music to it (I will probably do this).

However, computers follow logic and I would like to understand a bit about character encoding and locales. My questions:

1. Why I was able to play music with non-ASCII file names from a FAT32 disk until I mounted that disk as a cifs network drive.

2. Why are non-ASCII file names on the 'new' FAT32 memory stick not handled in a way that mpd/GMPC can use them.

3. What is the appropriate way to do a cifs mount with TC when file names contain non-ASCII characters.
« Last Edit: February 25, 2014, 10:44:46 AM by gavs »

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
Re: Puzzled by cifs mount, character encoding and locale
« Reply #1 on: January 16, 2014, 09:52:39 PM »
I did some more experimenting based on information from this link:
http://www.nslu2-linux.org/wiki/HowTo/MountFATFileSystems

Checking the output of mount, it appears that the FAT32 disks are mounted by default with codepage=437 and iocharset=iso8859-1, which should be OK. I decided to try the utf8 option using
   mount -o utf8 /dev/sdb1 /mnt/sdb1
When I restarted mpd, it gave the message that the database used the utf-8 characterset rather than iso8859-1 and it was discarding the file. It gave a permission error when it tried to recreate the database file. The mount command showed that dmask and fmask had been changed to 0022 even though the fstab file showed umask=000. I tried adding the umask option on the mount using
   mount -o utf8,umask=000 /dev/sdb1 /mnt/sdb1
Now mpd was able to recreate the database file, I presume using the iso8859-1 characterset. Now aterm displays filenames with two question marks, '??' in place of non-ASCII characters and mpd/GMPC displays the non-ASCII characters properly and can play those songs.

Unfortunately, when I shutdown and reboot the computer (with a default mount which does not specify any options), mpd discards the database file and recreates it using the utf-8 characterset. Back to not being able to play those songs.

To resolve this, I could modify the remaster and put the utf8 and umask options in the mount command or in fstab. Questions remain though.

1. Why does setting the utf8 option require that the umask option be specified in the mount command?

2. Why was it not necessary to specify the utf8 option when mounting FAT32 before I mounted it with cifs?

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
Re: Puzzled by cifs mount, character encoding and locale (partly understood)
« Reply #2 on: January 20, 2014, 09:01:15 PM »
After further investigation, I think I know why the non-ASCII filenames are not being processed correctly. It turns out that the mpd database file is readable in a text editor. At the beginning of an entry for an information source, mpd records the file system character encoding. Even though the mount command shows that the FAT32 drive is iso8859-1, mpd records utf-8 when a default mount is done. When a mount with the utf8 flag is done, mpd records the encoding as iso8859-1. Mpd uses Glib to determine the character encoding.

I did a clean install of TC and mpd on another computer. On that computer, mpd records the character encoding as iso8859-1 with both a default mount and with a mount specifying the utf8 flag. The file system is treated properly either way, as might be expected. The locale on the two computers is identical, all entries set to C, except LC_ALL which is empty.

The production computer worked like the clean install until after I experimented with the samba network drive. I do not understand how the change (corruption) in behaviour could have come about (I believe sda1 where the remastered core resides was unmounted while I experimented, but maybe my memory is faulty). Any suggestions about where to look welcome.

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
Re: Puzzled by cifs mount, character encoding and locale (partly understood)
« Reply #3 on: February 11, 2014, 04:25:03 PM »
An unexpected discovery. While compiling mpd and gmpc for TC5.x, I was having difficulties with the gmpc-tagedit plugin and I  wondered if it was related to permissions. So, I executed sudo mpd instead of just mpd. To my surprise, sudo mpd created a database file using UTF-8 character encoding, which is not correct for the FAT32 disk. Delete that database and execute mpd and it creates a database file using iso8859-1 character encoding, which is correct.

Is there an explanation for such behaviour?

P.S. To Moderator: So far, this topic only contains my musings. Perhaps it is an inappropriate topic for this forum? Should it be removed? By me? By you?

Offline curaga

  • Administrator
  • Hero Member
  • *****
  • Posts: 11048
Re: Puzzled by cifs mount, character encoding and locale (partly understood)
« Reply #4 on: February 12, 2014, 08:36:20 AM »
No need to remove, it's on topic. Sudo clears the environment, so you probably have a GLIB variable set somewhere.
The only barriers that can stop you are the ones you create yourself.

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
Thank you curaga for that hint. The output of printenv shows G_FILENAME_ENCODING=iso8859-1. The output of sudo printenv does not contain this variable. This appears to be the case also on a 'clean' TinyCore boot.

Question 1: Is this GLIB environment variable set by default in TC? If so, what is the reason for this? If this is not a TC install default, how might it have been set?

Question 2: This site https://www.kernel.org/doc/Documentation/filesystems/vfat.txt states that
Quote
By default, FAT_DEFAULT_IOCHARSET setting is used.
for vfat mounts. I do not see FAT_DEFAULT_IOCHARSET listed as an environment variable. Is this an environment variable and can it be set?

Question 3: The entry in /etc/fstab does not include an iocharset option for the vfat mount. Does this mean it will use FAT_DEFAULT_IOCHARSET and/or G_FILENAME_ENCODING to determine how to mount a vfat drive unless that option is specified in the mount command?

The hint from curaga about the GLIB environment variable helped resolve the original issue. In the TinyCore audio player remaster I used sudo mpd. Prior to January 2013 the external disk holding the music was formatted as NTFS, and was mounted with ntfs-3g, which handled the encoding properly. In January 2013 due to hardware problems with the disk (clicking noise and interrupted data transfer) the disk was reformatted first as ext and then as vfat. The clicking/data transfer problem persisted. By chance, I discovered that the clicking/data transfer problem went away if the external disk was on its edge (disk is vertical) rather than flat (disk is horizontal). I guess some interaction between gravity and the mechanical apparatus in the disk. I left the disk formatted as FAT. The character encoding problem was related to how mpd, which uses GLib, handled the vfat mount depending on use of sudo or not. It was not related to the cifs mount (though the cifs mount is what revealed the different encoding options). File names were properly resolved before that because it was an ntfs-3g mount.

New issue: Now using TC5.x. 'Clean' boot, install alsa, mpd-minimal, gmpc and gmpc-tagedit. Music is still on an external vfat drive. Start mpd and it creates its database using iso8859-1 encoding. The gmpc client displays and plays music with non-ASCII file names. All good so far. When I select a song with non-ASCII file name to edit music tags, gmpc reports that the file does not exist. Songs with ASCII file names could be cued for tag editing. I was starting gmpc from the icon in wbar. I was not sure if that environment was the same as a shell, where the mpd command is issued. So I started gmpc from the shell. No difference. I also tried a vfat mount with the utf8 flag. Again no difference. I guess gmpc does not use the GLIB variable. Gmpc-tagedit has a locale extension. I have not installed or used this. Would that have any effect on how file names are interpreted? Any suggestions on how to fix this?

Offline curaga

  • Administrator
  • Hero Member
  • *****
  • Posts: 11048
Question 1: Is this GLIB environment variable set by default in TC? If so, what is the reason for this? If this is not a TC install default, how might it have been set?

It is set in /etc/profile. You can override it in your user files (~/.profile). I can't recall the specifics, but IIRC something had to be set for glib apps to work properly.

Quote
Question 2: This site https://www.kernel.org/doc/Documentation/filesystems/vfat.txt states that
Quote
By default, FAT_DEFAULT_IOCHARSET setting is used.
for vfat mounts. I do not see FAT_DEFAULT_IOCHARSET listed as an environment variable. Is this an environment variable and can it be set?

It is a kernel config item, CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1". You can override with mount options.

Quote
Question 3: The entry in /etc/fstab does not include an iocharset option for the vfat mount. Does this mean it will use FAT_DEFAULT_IOCHARSET and/or G_FILENAME_ENCODING to determine how to mount a vfat drive unless that option is specified in the mount command?

It uses FAT_DEFAULT_IOCHARSET if you don't specify one. The glib variable is used by glib apps when they handle files, ie how they interpret your filenames.
The only barriers that can stop you are the ones you create yourself.

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
Thank you for those clarifications curaga.

So, the FAT disk is being mounted as iso8859-1 using the default FAT_DEFAULT_IOCHARSET and mpd is creating the database as iso8859-1 based on querying the GLIB variable G_FILENAME_ENCODING. Gmpc, the mpd client, uses the mpd database to display and play songs. That seems to work fine and non-ASCII characters are displayed properly. However, the tagedit feature needs to access songs by filename in order to edit the tags. When songs that have non-ASCII filenames are selected, it reports that the file does not exist. Based on this error message, I assume that gmpc does not use the GLIB variable. I am guessing then that it uses locale. I used the getlocale extension and edited the extlinux.conf file to boot TC in en_US.iso8859-1. Gmpc still reports that the file does not exist. The only difference that changing the locale made was that the error message displayed the non-ASCII characters correctly. I also tried loading the gmpc-tagedit-locale extension, but that made no difference.

Question. Is there any other way that gmpc-tagedit is determining filename encoding, and how can I set it?

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
Having no further ideas about how to get gmpc-tagedit to recognize the non-ascii filenames on the FAT32 formatted disk, I naively thought everything would be resolved by simply copying the files to an ext formatted disk. I thought that all that would be needed would be to set the GLIB variable G_FILENAME_ENCODING to something consistent with ext. I tried "export G_FILENAME_ENCODING=@locale" and "export G_FILENAME_ENCODING=utf-8". In both cases, now the mpd database does not the songs with non-ASCII file names.

Question 1. This approach does not work because copying the files to an ext disk does not change the encoding, the encoding is set when the file is created?

Question 2. Since the files were created using Exact Audio Copy in WindowsXP, the encoding is iso-8859-1, which is then not properly decoded if G_FILENAME_ENCODING is specified as either @locale or utf-8?

Question 3. Setting G_FILENAME_ENCODING=@locale results in encoding of ANSI_X.3-1968. I understand this is ASCII. Is this the default encoding used by TinyCore and does this mean that file names created in TinyCore (when locale is not modified) are limited to ASCII characters?

I am guessing that gmpc-tagedit may be set up by default for unicode and utf-8, since this seems to be the prevailing direction in Linux. To test this I need to convert these non-ASCII file names from iso-8859-1 to utf-8. I gather that a utility convmv can do this.

Question 4. Is the utility convmv available in the TC4.x or TC5.x repos? I could not find it.

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
First, I believe that I was wrong to state
Quote
Since the files were created using Exact Audio Copy in WindowsXP, the encoding is iso-8859-1
. The files were created on an NTFS drive, so I guess the file names were encoded as Unicode UTF-16. I am guessing that when the files get copied to a FAT32 disk the file names are converted to iso8859-1.

I obtained convm and used it to convert the song file names from iso-8859-1 to utf-8. The utility identified that it converted the songs with non-ASCII file names. I suppose the others did not need converting because the first 127 utf-8 codes match the ASCII codes. I then used getlocale to make a utf-8 mylocale extension and edited the boot code for lang=en_US.UTF-8. I rebooted and exported G_FILENAME_ENCODING=UTF-8, installed alsa, mpd-minimal and gmpc. When I start mpd it creates a database, identifying file encoding properly as utf-8. I start gmpc (from the command line of the shell, to ensure that it had the same environment variables) and it displays and plays all the song files properly, including the non-ASCII ones. For this function, it appears that gmpc uses the mpd database. However, for editing tags, gmpc needs to access the files directly from the file system. Well, I guess I was wrong again when I stated
Quote
I am guessing that gmpc-tagedit may be set up by default for unicode and utf-8
. Not only does gmpc report that the non-ASCII files do not exist, it reports that for all files. It seems to not be processing any file names properly when the locale is en_US.UTF-8. Loading gmpc-locale and gmpc-tagedit-locale do not seem to have any effect. A side advantage of starting gmpc from the shell is that messages are displayed. I noticed that upon starting gmpc the following messages were listed:
Code: [Select]
Gdk - WARNING **: locale not supported by C library
Gtk - WARNING **: locale not supported by C library
             Using the fallback 'C' locale

Question: Is the absence of support for en_US.UTF-8 due to how gmpc and/or gmpc-tagedit were compiled or is it related to the source code?

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
I tried to see if I could find out more about the locale warning message (I was doubting that it was related to compilation or source code). I saw one message that said if executing locale shows problems, then locale is not properly set. Perhaps I should have mentioned in my previous post that executing locale returns these messages
Code: [Select]
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
The locale entries then all show en_US.UTF-8.

Searching for the locale warning in the Tiny Core Forum, I found this topic, http://forum.tinycorelinux.net/index.php/topic,1800.0.html which I guess is what the mylocale extension does for users. A more recent TC topic had this message, http://forum.tinycorelinux.net/index.php/topic,15517.msg90327.html#msg90327 which the author attributed to not having loaded mylocale. I think I have used getlocale properly. I specified only one selection (used the space bar and that put an asterix next to en_US.UTF-8. A mylocale extension was generated. I added the lang=en_US.UTF-8 boot parameter. But I must be doing something wrong and the en_US.UTF-8 locale is not being properly installed? The output of locale -a gives the same 3 lines shown above and then only lists C and Posix.

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
I must have done something wrong when mylocale was created. The created extension had no files and only one link to locale-archive. I removed the lang boot code, removed the old mylocale extension files, rebooted, loaded getlocale and created mylocale again. This time everything worked fine. When I reboot with the lang boot code and execute locale -a it does not report any missing files and it displays all the locales that I included as being available.

I booted with locale en_US.UTF-8, entered G_FILENAME_ENCODING=utf-8, installed alsa, mpd, gmpc and gmpc-tagedit. Music file names were encoded as utf-8. The mpd database was created properly using utf-8 encoding. Starting gmpc from the shell did not show any errors. GMPC displayed and played songs, including songs with non-ASCII file names. Now, GMPC was able to queue both ASCII and non-ASCII file names for tag editing.

I went back to try iso8859-1 locale (when I tested this earlier, it is possible that mylocale extensions were not working). I booted with locale en_US.iso88591, entered G_FILENAME_ENCODING=iso8859-1, installed alsa, mpd, gmpc and gmpc-tagedit. Music file names were encoded as iso8859-1. The mpd database was created properly using iso8859-1 encoding. Starting gmpc from the shell did not show any errors. GMPC displayed and played songs, including songs with non-ASCII file names. BUT, GMPC reported file not found when I tried to queue non-ASCII files names for tag editing.

OK, one more trial to try to understand this. I booted with locale en_US.iso88591, entered G_FILENAME_ENCODING=utf-8, installed alsa, mpd, gmpc and gmpc-tagedit. Music file names were encoded as utf-8. The mpd database was created properly using utf-8 encoding. Starting gmpc from the shell did not show any errors. GMPC displayed and played songs, including songs with non-ASCII file names. Now, GMPC was able again to queue both ASCII and non-ASCII file names for tag editing.

My conclusions: Mpd uses the G_FILENAME_ENCODING variable to interpret file names and create the database. Gmpc uses the mpd database to display and play songs. If the mpd database was created using the correct encoding, gmpc can display and play songs with non-ASCII file names. Gmpc accesses the file system directly to retrieve files for tag editing (also for album art I believe). It appears that, regardless of locale, gmpc can retrieve non-ASCII file names only if they are encoded as utf-8.

Offline curaga

  • Administrator
  • Hero Member
  • *****
  • Posts: 11048
Re: Puzzled by cifs mount, character encoding and locale (resolved)
« Reply #12 on: February 25, 2014, 02:24:14 PM »
Nice to see you got it resolved. Many components involved, and all had to be correct.
The only barriers that can stop you are the ones you create yourself.

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
Re: Puzzled by cifs mount, character encoding and locale (resolved)
« Reply #13 on: February 28, 2014, 09:59:23 AM »
Thank you for the support and for developing the getlocale utility, a very useful extension.

I want to correct a misconception when I stated:
Quote
The files were created on an NTFS drive, so I guess the file names were encoded as Unicode UTF-16. I am guessing that when the files get copied to a FAT32 disk the file names are converted to iso8859-1.
I was wrong to guess that copy does any kind of encoding conversion. It does not. I confirmed this by copying a file I had converted to utf-8 using convmn back on to a FAT32 usb drive. The file name remained as utf-8. It had 2 ? representing the 2 bytes in utf-8 for non-ASCII characters compared to the 1? for the iso-8859 file name which uses only 1 byte for that non-ASCII character. I am not knowledgeable in this area, but it seems that the WindowsXP console uses code page 437 but the OS and applications generally use Unicode and code page 1252. It happens that 1252 encoding matches is08859-1 for almost all characters. So the files I created with Exact Audio Copy in WindowsXP have file names encoded as iso8859-1. I can use convmv to convert to utf-8 so that gmpc tagedit can handle them.

I am stumbling along in this confusing topic of character encoding, hoping that my experiences are useful to others. If I have made any incorrect statements, for example that gmpc-tagedit can only handle utf-8 file names, I hope that someone more knowledgeable will point out the errors.

Offline gavs

  • Jr. Member
  • **
  • Posts: 74
Re: Puzzled by cifs mount, character encoding and locale (resolved)
« Reply #14 on: March 19, 2014, 06:34:08 PM »
Perhaps this is common knowledge, but I thought I should note it here in case it is useful for others. Convmv is a useful utility for converting file names to utf-8 in place on a drive. My plan was to reformat the FAT32 drive as ext. This meant that I would be copying the files off the FAT32 disk to an ext formatted disk. In such a situation, if the FAT32 disk is mounted with the utf8 flag, the system sees the file names as utf-8 encoded and the copies put on the ext disk are already utf-8 encoded file names.

With the increasing prominence of utf-8, would it be useful to revisit some aspects of the default TC setup? Two things in particular:
1. to set G_FILENAME_ENCODING=UTF-8
2. to set the utf8 flag by default for vfat mounts
Perhaps this is not the right place to raise this matter?