From: Leiss, Klaus-G. 3. S-PP-RD-E. <Kla...@he...> - 2003-02-21 07:56:22
|
Hello, I have a question regarding international characters not in the wiki character set. Since the ampersand gets escaped I am not able to use character entities. What was the reason for this design, is there a sequrity reason for it. If not I would like to implement something that enables me to use some. Klaus Leiss |
From: Jeff D. <da...@da...> - 2003-02-21 17:50:24
|
> I have a question regarding international characters not in the wiki > character set. Since the ampersand gets escaped I am not able to > use ccharacter eentities. What was the reason for this design.. ? This has been discussed before, but I don't remember the answer. Offhand, I can't think of a reason why entities should be allowed. Can anyone else? Note that currently the page data is stored as ISO-8859-1, so you can use any ISO-8859-1 character, as long as you can figure out how to type it in directly. This handles all the western european accented and umlauted characters (=F9=FA=FB=FC) as well as a few funky things like =A7=BC=BD=BE=D7=A9=A5=AE=B0=B1=B5=B6 If we allowed entities, we'd want to convert ones representable in ISO-8859-1 to ISO-8859-1 upon page saving, since otherwise it would hinder searching. (Searches for "H=E4schen" won't find the text "Häschen".) Allowing entities would be a simple enough fix. I'll do it if no one comes up with a reason not to. |
From: Klaus - G. L. <Le...@we...> - 2003-02-21 20:34:20
|
Hello, > > I have a question regarding international characters not in the wiki > > character set. Since the ampersand gets escaped I am not able to > > use ccharacter eentities. What was the reason for this design.. ? > Offhand, I can't think of a reason why entities should be allowed. > Can anyone else? > > Note that currently the page data is stored as ISO-8859-1, > so you can use any ISO-8859-1 character, as long as you can > figure out how to type it in directly. This handles all > the western european accented and umlauted characters (=F9=FA=FB=FC) > as well as a few funky things like =A7=BC=BD=BE=D7=A9=A5=AE=B0=B1=B5=B6 My question provides part of the answer, I asked because of characters "not" in the wiki character set. But this is for future use. My real reason for the question was that in a webboard that I use we had difficulties with this. There had been users with different computerplatforms and different local charsets and the charcaters above 127 were different. I assume that is a problem of the browser. To get this correct the browser would have to translate the charecterset during input. This seemed not to function. I have no experience with this situation in phpwiki since at the moment all users of my wiki use the same platform and the same character set. I asked now because i wanted to be sure that future users could get the correct characters into the wiki. I think if you wanted to discuss music or language you would need many characters that are not in ISO-8859-1. > If we allowed entities, we'd want to convert ones representable > in ISO-8859-1 to ISO-8859-1 upon page saving, since otherwise it > would hinder searching. (Searches for "H=E4schen" won't find > the text "Häschen".) This could mean that entities have to be allowed also in searches. > Allowing entities would be a simple enough fix. I'll do it > if no one comes up with a reason not to. Another unrelated question, if I try to answer a message from the list, the address is that of the sender and not the list. Is it recommend that I answer only to the person or should I answer to the list. Klaus Leiss |
From: Jeff D. <da...@da...> - 2003-02-21 21:09:17
|
On Fri, 21 Feb 2003 21:33:51 +0100 "Klaus - Guenter Leiss" <Le...@we...> wrote: > There had been users > with different computerplatforms and different local charsets > and the charcaters above 127 were different. I assume that > is a problem of the browser. Yes, that's probably right. For phpwiki output, the character set is specified in the Content-Type: HTTP header. Browsers are supposed to respect that. For form input, (like when editing a page) phpwiki sets the accept-charset attribute of the <form> tag. This should make browsers submit the form input in the proper character set. (Probably there are some forms that phpwiki generates which don't have the proper accept-charset attribute. That's a phpwiki bug... Report those if you find them.) > I think if you wanted to > discuss music or language you would need many characters > that are not in ISO-8859-1. ... or math... But then that requires things not even in UTF. > > (Searches for "H=E4schen" won't find the text "Häschen".) > This could mean that entities have to be allowed also in searches. The real solution to that is to switch to using UTF-8 internally, so that we can store all those nice characters in a uniform manner. The problem is that MySQL and PHP support for this is not (last I checked) good (or universal) enough to do this. Or, as a hack, I guess we could store everything in US-ASCII, with all characters above 127 converted to some canonical entity. (i.e. =DC always needs to be stored the same way: either Ü or Ü ) Another option would be to internally use UTF-8 (in PHP), but (assuming enough people have UTF-8 enabled PHP's) and convert=20 to US-ASCII as described above in the backend for those database who don't support unicode. Just thinking aloud... or visibly, rather... As for replying to the list or to the sender (or both): I say do whatever you think appropriate for the message. My MUA (sylpheed) seems to reply to the list by default. It must be picking the address from the List-Post: header, since there doesn't seem to be a Reply-To:, and From: lists the sender, not the list. |
From: Carsten K. <car...@us...> - 2003-02-22 04:40:23
|
Using UTF-8 would be the ideal solution. PostgreSQL support for UTF-8 worked well last time I tried it. Only MySQL newer than 4.1.x (Alpha) has UTF-8 but I think most people still use 3.23 including myself. I just read about some new improvements to PHP. Apparently as of PHP 4.3.0, the option --enable-mbstring is the default which means mbstring functions should be present unless PHP is explicitly compiled otherwise. The good news here is that various PHP string/regex functions (if I read this correctly) can be automatically overloaded with multi-byte string functions: http://www.php.net/manual/en/ref.mbstring.php > Multibyte extension (mbstring) also supports 'function overloading' to > add multibyte string functionality without code modification. Using > function overloading, some PHP string functions will be oveloaded > multibyte string functions. For example, mb_substr() is called instead > of substr() if function overloading is enabled. Function overload > makes easy to port application supporting only single-byte encoding > for multibyte application. mbstring.func_overload in php.ini should be > set some positive value to use function overloading. Carsten On Friday, February 21, 2003, at 04:09 pm, Jeff Dairiki wrote: > The real solution to that is to switch to using UTF-8 internally, > so that we can store all those nice characters in a uniform manner. > The problem is that MySQL and PHP support for this is not > (last I checked) good (or universal) enough to do this. |
From: Aredridel <are...@nb...> - 2003-02-22 17:29:06
|
On Fri, Feb 21, 2003 at 11:40:25PM -0500, Carsten Klapp wrote: > I just read about some new improvements to PHP. Apparently as of PHP > 4.3.0, the option --enable-mbstring is the default which means mbstring > functions should be present unless PHP is explicitly compiled otherwise. > > The good news here is that various PHP string/regex functions (if I > read this correctly) can be automatically overloaded with multi-byte > string functions: The standard posix functions are, but pcre UTF-8 is still really dubious. I've enabled it in the wiki I'm maintaining, which is fully UTF-8, and it doesn't work perfectly yet. All in all, UTF-8 is the best internal encoding -- most search algorithms don't have to be modified, and false-positives with a non-UTF aware search are very infrequent. Any database that can handle ISO8859-1 can handle UTF-8 for storage, so that's not really an issue. Browsers don't always send UTF-8 /back/ when you request it, that's one major issue. You do have to validate form submission. Ari |
From: Klaus - G. L. <Le...@we...> - 2003-02-21 22:04:19
|
> On Fri, 21 Feb 2003 21:33:51 +0100 > "Klaus - Guenter Leiss" <Le...@we...> wrote: > For form input, (like when editing a page) phpwiki sets > the accept-charset attribute of the <form> tag. This should > make browsers submit the form input in the proper character set. > > (Probably there are some forms that phpwiki generates which > don't have the proper accept-charset attribute. That's a phpwiki > bug... Report those if you find them.) As I mentioned in my post, no experience with phpwiki. Another system had this error. If I ever get this problem I will report this as a bug. But if most modern browser support this than it should no problem for me. > Or, as a hack, I guess we could store everything in US-ASCII, > with all characters above 127 converted to some canonical entity. > (i.e. =DC always needs to be stored the same way: either Ü > or Ü ) > > Another option would be to internally use UTF-8 (in PHP), but > (assuming enough people have UTF-8 enabled PHP's) and convert > to US-ASCII as described above in the backend for those database > who don't support unicode. Could you store the records as binary data. I worked only with file as an database, and have seen that you store serialized data. But I don't know which databases would let you do this. Klaus Leiss p.s. is the 1.2 series actively maintained? At the moment I'm running 1.2.2 since I can't get 1.3.4 running because my webhoster runs PHP in safemode and until now we didn't get the accessright right. I'm asking this because I added a RemovePage funtion to db_filesystem.php . |
From: Jeff D. <da...@da...> - 2003-02-21 23:11:46
|
> Could you store the records as binary data. Yes, but then you can't use the database text-search functionality (which would be a big performance penalty.) (E.g. you can't do case-insensitive searches on binary data.) This is {only,primarily} an issue with the SQL databases, of course, since file and dbm database don't have text search ability (so PhpWiki has to iterate over each page itself to do the search.) |
From: Martin G. <gim...@gi...> - 2003-02-22 10:12:15
|
Jeff Dairiki <da...@da...> writes: >> Could you store the records as binary data. > > Yes, but then you can't use the database text-search functionality > (which would be a big performance penalty.) (E.g. you can't do > case-insensitive searches on binary data.) This is {only,primarily} > an issue with the SQL databases, of course, since file and dbm > database don't have text search ability (so PhpWiki has to iterate > over each page itself to do the search.) Isn't this almost what the database has to do today? A FullTextSearch for 'Hello World' uses this SQL as the $search_clause: (LOWER(pagename) LIKE '%hello%' OR content LIKE '%hello%') AND (LOWER(pagename) LIKE '%world%' OR content LIKE '%world%') I don't think any database will be able to optimize on such a query very much because it contains the computed value LOWER(pagename) which means that an index for pagename probably cannot be used. Also, the LIKE search is expensive if it cannot use an index. The manual for MySQL actually says that it can't: The following SELECT statements will not use indexes: mysql> SELECT * FROM tbl_name WHERE key_col LIKE "%Patrick%"; mysql> SELECT * FROM tbl_name WHERE key_col LIKE other_col; In the first statement, the LIKE value begins with a wildcard character. In the second statement, the LIKE value is not a constant. That was from http://www.mysql.com/doc/en/MySQL_indexes.html#IDX879. When the database is done searching through all the text in the Wiki, then it's processed further with regular expressions to do the highlighting... So perhaps it would be almost as fast if we retrieved all the text from the database, and then searched and highlighted in one step? Then we could store the data in UTF-8 in the database (which would be extremely cool!), because it no longer has to deal with the data, just store it for us. I agree that this sounds a little ugly --- databases are meant to be used to speed up such searches through large masses of data, but since the current code already forces the database to do a slow search, and we then also search (highlight) ourselves afterwards, perhaps it isn't that much uglier than the current scheme... -- Martin Geisler My GnuPG Key: 0xF7F6B57B See http://gimpster.com/ and http://phpweather.net/ for: PHP Weather => Shows the current weather on your webpage and PHP Shell => A telnet-connection (almost :-) in a PHP page. |
From: Jeff D. <da...@da...> - 2003-02-23 18:46:08
|
On Sun, 23 Feb 2003 01:06:41 +0100 Martin Geisler <gim...@gi...> wrote: > Jeff Dairiki <da...@da...> writes: > > One of these days, (after user-auth and other things have > > stabilized) it might be good to refactor the backends and SQL schema > > a bit. > > I think that would be nice. I noticed how the database is locked and > unlocked with each operation, even on backends like PostgreSQL (and > now also MySQL with InnoDB) that support transactions. This seams to > be something that could benefit from a cleanup. (Of course one of the big headaches is to optimize the schema while keeping the backend API general enough that we can write and maintain the non-SQL backends (dba, flat-file) as well. Also, it's good to share code as much as possible between the different flavors of SQL, otherwise we get all kinds of funny bugs creeping into the lesser-used backends...) > A similar thing would be a plugin that generates an index like the > ones you find in the back of most books. I'm not sure if this can be > done automatically, but the plugin could skip words that appear on > more than perhaps 10% of the pages or something like that. And it > should also skip words on a stoplist. That's an interesting idea. I'm beginning to think about a more a general API to allow caching of plugin ouput (this would be integrated with the caching of marked-up page content). Once that's in place a plugin like that would be viable. (Until then ... its worth playing with but is going to be slow.) A related idea would be a way to manually enter search terms on pages. E.g. something like "<?plugin Keywords platypus, funny animals ?>" These could be used to form a real book-style index, and to generate a meta keywords tag for search engines... Basically the same as Category pages, I guess... (Maybe we should generate a keywords meta tag from Category links on each page?) Okay, so now I'm just rambling.... |
From: Martin G. <gim...@gi...> - 2003-02-25 18:48:40
|
Jeff Dairiki <da...@da...> writes: > On Sun, 23 Feb 2003 01:06:41 +0100 > Martin Geisler <gim...@gi...> wrote: > >> I noticed how the database is locked and unlocked with each >> operation, even on backends like PostgreSQL (and now also MySQL >> with InnoDB) that support transactions. This seams to be something >> that could benefit from a cleanup. > > (Of course one of the big headaches is to optimize the schema while > keeping the backend API general enough that we can write and > maintain the non-SQL backends (dba, flat-file) as well. Also, it's > good to share code as much as possible between the different flavors > of SQL, otherwise we get all kinds of funny bugs creeping into the > lesser-used backends...) Yes, it's not easy have an efficient backend that works with several different databases, some of which cannot use SQL. Or rather, it's somewhat easy if you just reimplement the entire backend for each database, but that's an awful waste of code... >> A similar thing would be a plugin that generates an index like the >> ones you find in the back of most books. > > That's an interesting idea. I'm beginning to think about a more a > general API to allow caching of plugin ouput (this would be > integrated with the caching of marked-up page content). Once that's > in place a plugin like that would be viable. (Until then ... its > worth playing with but is going to be slow.) It doesn't matter much if it's slow, it should only be used on static WikiWikiWebs where you cannot search. > A related idea would be a way to manually enter search terms on > pages. E.g. something like "<?plugin Keywords platypus, funny > animals ?>" These could be used to form a real book-style index, and > to generate a meta keywords tag for search engines... That sounds like a really good idea! It would give much better results than a raw index of all the words, because there keywords would be selected with care... Is there a way for a plugin to save some data (in this case the keywords for a page) in a central place, so that an MakeIndex plugin could get hold of the data to print an index? > Basically the same as Category pages, I guess... Yes, but much more fine-grained. > (Maybe we should generate a keywords meta tag from Category links on > each page?) > > Okay, so now I'm just rambling.... No :-) I like the idea of using links to Category pages as meta information! -- Martin Geisler My GnuPG Key: 0xF7F6B57B See http://gimpster.com/ and http://phpweather.net/ for: PHP Weather => Shows the current weather on your webpage and PHP Shell => A telnet-connection (almost :-) in a PHP page. |
From: Martin G. <gim...@gi...> - 2003-02-23 02:54:26
|
Jeff Dairiki <da...@da...> writes: (Shouldn't that replay go to the list? If so, then you can just post your reply to this mail there...) >> Also, the LIKE search is expensive if it cannot use an index. The >> manual for MySQL actually says that it can't: > > Interesting (& good) points. I still suspect it's a fairly large > performance win to have mysql do the iterative search rather than doing > in PHP. (You avoid the mysql->PHP communcation and the mysql code, > while using the same basic search algorithm is written in C rather > than PHP.) Yes, that should give MySQL better performance... It also depends on where the MySQL and web server is located --- on the same machine so that they can communicate using a local unix socket or on different machines which means real network trafic... > On the other hand, maybe PhpWiki over-all is slow enough that the > difference is unimportant. I guess it would take some experiments to > find out. (The answer, I suspect depends heavily on the size of > the wiki, too...) Yes, I'm not sure either that it would make much of a difference with all those other lines of PHP code that's executed all the time :-) I've played a bit with the new full-text index in MySQL and it works OK. It was just a quick hack where I stored the original search query in the TextSearchQuery class and then added a text_search() to the WikiDB_backend_mysql class. So the highlight is wrong. > Anyhow, none of that is currently high on my list of priorities... > > One of these days, (after user-auth and other things have > stabilized) it might be good to refactor the backends and SQL schema > a bit. I think that would be nice. I noticed how the database is locked and unlocked with each operation, even on backends like PostgreSQL (and now also MySQL with InnoDB) that support transactions. This seams to be something that could benefit from a cleanup. > (Pagetype and the cached markup should each be in their own column, > rather than stored in the general meta-data hash.) But that's for > some other time... Yes, there's plenty of other things to hack on :-) > (Another project would be to implement a real word index, so that > the SQL searches would be indexed.) That sounds like a good idea --- the more work we can do when saving the pages, the better. A similar thing would be a plugin that generates an index like the ones you find in the back of most books. I'm not sure if this can be done automatically, but the plugin could skip words that appear on more than perhaps 10% of the pages or something like that. And it should also skip words on a stoplist. Such a plugin would be cool for fast, exported sites where you cannot do a dynamic search. I'm beginning to think of PhpWiki as a tool that can be used to quickly build static sites: it's quick and easy to update the contents, the linking capabilities are great. And the look of everything is controlled by the template system. -- Martin Geisler My GnuPG Key: 0xF7F6B57B See http://gimpster.com/ and http://phpweather.net/ for: PHP Weather => Shows the current weather on your webpage and PHP Shell => A telnet-connection (almost :-) in a PHP page. |