From: Michael W. <wes...@ja...> - 2020-11-27 02:02:15
|
In the 1990s, I made it a habit to ONLY use 7-bit ASCII characters in file names, avoiding anything outside of [0-9A-Za-z_-]. The reason being that I moved between MS-DOS/Windows (using Shift_JIS character encoding) and FreeBSD (using euc-jp encoding) a great deal and needed everything to work well from the command line in both. The late 1990s saw the beginning of UTF-8 taking over as a unified character set, and once I could set that as my main character set in FreeBSD (and by extension OSX in the early 2000s) as well as in MySQL and other databases, I started feeling better about allowing Japanese file names into my systems. But I still avoided special characters that had meaning to bash and other shells. Again, I wanted names that could be typed on the command line without having to escape a bunch of characters. I wouldn't (and still don't) even put spaces into file names. !#$%&'"*+,;{}()[]/\ in file names are just asking for trouble. (The person who decided to use \ as a directory separator for MS-DOS has caused more damage to the PC industry than can be calculated.) When I've needed to automate something based on a file name, I would use a series like _xyz_ to denote it. I could use regular expressions to extract the name from between underscore pairs (I use hyphens as word separators, so underscores are ONLY used for variable substitutions). The point is, the conventions you use for file names should (1) avoid special characters that may get you in trouble with the command line and (2) allow you the flexibility you need within the constraints of #1. Unfortunately, that doesn't help with the Cyrillic ё problem. My eXist procedures were all started around version 1.2, and most of what I do with eXist is done by running REST bases XQueries. The above naming conventions work fine internally. XML files are mainly modified via an XQuery process. I may sometimes touch up a file directly with oXygen or Nova via WebDAV (only occasionally using eXide). But nothing else touches the XML files, so there are no problems. My processes were developed over the past decade, which is probably why my processes have evolved this way. These were the tools with the least amount of resistance at the time. You have so many more choices in work flows now. GUI operating systems have made it much easier to go ahead and use special characters in file names. A file name that would have caused a corrupt hard disk in the past is now taken care of in the background so users don't have to worry about it. So long as you're staying within that system, you should be okay. Try to integrate that externally and all bets are off. With the main bridge between eXist and other systems being RESTful XQueries, I know that I can keep everything consistent up to the output. Consumers of that output, such as old versions of the Japanese version of Excel can only handle Shift_JIS characters. I can output in Sift_JIS, but my data has many names of people with characters outside of that character set. So clients who insist on using old software get a ? in place of those characters (吉國 becomes 吉?). It's a known limitation of the client side software that is a problem. The only work around is for the customer to upgrade to modern software. Well, I hope this short history helps in some way. Take care. 2020年11月27日(金) 1:50 David Birnbaum <dj...@gm...>: > Dear exist-open (cc Christian, Peter), > > Thank you, Christian, for this information, which I had not known about > previously. If eXist-db installs apps by first unzipping them onto the > server filesystem in a expathrepo directory (giving the OS a chance to > normalize the filenames), does this mean that whether those who install my > app get a composed or decomposed representation of a composite character in > a filename may depend on their OS? If that is the case, it may not matter > where the confusion of the two representations happens, since I need it to > be possible for users on MacOS and other OSs to install the app, and to be > able to address the files by name over REST using the same HTTP URIs. > > In response to Peter's observation, although I can set the locale on my > own machine, those who install my app may not know how to set a locale, or > be (reasonably) unable or unwilling to change their setting. My locale is > set to en_US.UTF-8, and response headers on a REST return show > "application/xml; charset=UTF-8" as the content type, so I thought it > should have been able to handle either the composed or decomposed > representation of Cyrillic ё. "Should have been able" has been a leitmotif > in this thread, though, and the additional observations from Christian and > Peter seem to suggest that even should the eXist-db resource management > interfaces be updated to handle filenames with non-ASCII characters > robustly (as the WebDAV interface already seems to do), interacting with > the files by filename using the REST interface may face challenges that > originate outside eXist-db. > > Best, > > David > > On Wed, Nov 25, 2020 at 11:39 PM Christian Wittern <cwi...@gm...> > wrote: > >> Dear David, >> >> Just chiming in on this very specific point: >> >> On 26/11/2020 03.05, David Birnbaum wrote: >> > But for those who are curious: on my Mac, with the data directory left >> > with the default value /Users/djb/Library/Application >> > Support/org.exist, packages are unpacked >> > under /Users/djb/Library/Application Support/org.exist/expathrepo/, so >> > trying to install the xar file let me see whether ё was represented by >> > a single composite character or by a base followed by a combining >> > diacritic. >> >> The macOS file system will normalize characters to use decomposed forms >> of characters that are encoded as pre-composed characters in Unicode as >> a rule, so in this case the observed fact does not allow any conclusions >> regarding the handling within eXist or its I/O pipelines. For that >> purpose, you would need to use a different OS. >> >> All the best, >> >> Christian >> >> >> >> _______________________________________________ >> Exist-open mailing list >> Exi...@li... >> https://lists.sourceforge.net/lists/listinfo/exist-open >> > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > -- Michael Westbay Writer/System Administrator http://www.japanesebaseball.com/ |