From: Jeff D. <da...@da...> - 2003-02-25 17:00:41
|
On Tue, 25 Feb 2003 10:49:35 +0100 "Leiss, Klaus-Guenter 3188 S-PP-RD-E2" <Kla...@he...> wrote: > > Klaus suggested to me that simply removing the % sign from > > PhpWiki dump > > / pgsrc filenames might eliminate such esoteric ftp problems. > > It means > > a bit more work to write code for PhpWiki to decode these > > filenames but > > I believe it can be done. Any comments / suggestions? > It was not only because of my ftp problem, but also because > somebody ( I think jeff ? ) mentioned something about a > slash problem. All right here's a bit more ramblings and background about the problem as I see it. The slash problem is ancillary. The real problem I see is as follows. There are (at least) two common uses for static (X)HTML dumps: 1. Dump to local file-system for off-line browsing. 2. Static wiki image to be served up by a web-server for a high-traffic, but infrequently edited wiki. Now currently a page name like "A Page" gets dumped to a file named "A%20Page.html". The problem is, in case 1, one must link to that page like: <a href="A%20Page.html">A Page</a> While in case 2, you must double escape the %: <a href="A%2520Page.html">A Page</a> The difference arises because in the second case, there's a web server involved, and the web-server urldecodes the URL before interpreting it. So the problem is specifically one of using urlencoding to generate file names. I'm not sure what the best solution is. I think (though I haven't though too hard about it) that using something like MIME quoted-printable encoding instead of urlencode() would work fine. (The two encodings are basically the same, except that quoted-printable uses an equals sign ('=') instead of a percent sign as the escape character.) ( "A=20Page.html" ) Perhaps we should think ahead and pick an encoding which is more Unicode friendly? We also want to think about ways to encode page actions within XHMTL dump filenames. For example we might want to dump all the backlinks for each page too (or dump all the versions of each page, or...) As for the slash problem: changing encodings could be used to fix the slash problem. I'm not sure that's the best solution. It may be better to actually reproduce the directory structure implied by the slashes (with URLs in links from a subpage would have have to be prefaced by the appropriate number of ../s.) Some random points: Encoding to completely alphanumeric [a-zA-Z0-9] filenames, I think, is impractical, if only because that means we'll have to encode nearly every page name. We're going to have to use at least one (funny character) in the encoding. I don't really see this ftp-server problem as significant. (Until someone convinces me it's a really common problem. It seems appalling to me that a web host would not admit that an ftp server which can't be used to upload a file with a perfectly legal name is broken.) (BTW, Klaus, did you have any luck using zipped pgsrc?) Now might be a good time to think hard, and more precisely define what the space of allowed wiki page names is. E.g., since the introduction of subpages, page names beginning with '/' have implicitly become illegal, since there's no way to link to them. ('[/Page]' is always interpreted as a link to a subpage of the page on which the link appears.) So, if anyone had a wiki with a page name which began with a '/', and then upgraded to a PhpWiki which subpages, now they can't access their page anymore. Another example is pages containing the '#' character. Introduction of the named anchor syntax broke those pages, since now '[Page#Two]' is a link to the anchor named "Two" within the page named "Page". (This is now "fixed" by supporting the escape character within bracket links, so you can used '[Page~#Two]' to link to the page named "Page#Two". This is all very confusing though...) It might be worth excluding certain characters from being allowed in page names. 0. Control characters should explicitly be illegal within page names. 1. '/' only allowed as subpage-separator. 2. Disallow '#' in page names? That might be too painful, since "Bulletin #23" seems like a perfectly reasonable page name. 3. Disallow ':'? Again, one can envision reasonable page names containing a colon, but making it illegal would make parsing of InterWiki:Links much easier. (E.g. one could recognized un-registered interwiki links as being such.) 4. Disallow ';'? If this were done, one could come up which a much cleaner MagicPhpWikiURL syntax: [SomePage;version=3] (or [SomePage;3]) [AnotherPage;action=edit] (or [AnotherPage;edit]). 5. It is currently hard but possible to create pages with leading and/or trailing space in the name. This should probably be outlawed. In fact, we should probably canonicalize the spacing in page names: strip leading and trailing whitespace, convert each occurance of (possibly repeated) internal whitespace to single space characters. Enough ramblings. Comments welcome. |