You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
From: Geoff H. <ghu...@ws...> - 2002-01-29 20:00:15
|
On Tue, 29 Jan 2002, Neal Richter wrote: > defined.. virtual void functions, etc. The current Retriever is highly > build around the idea of webpages and HTTP (of course)... The Retriever class isn't really built around much of anything IMHO. It requires that documents have a URL and that the URLs can be grouped into Server objects. In the 3.1 code, the Document class was bound to webpages and HTTP. In the 3.2 code, the Document class is more of a "branch point," picking Transport and Parser objects as needed. > you could write a retriever to get docs directly out of a database, > from a file, scp, POP, via parameters, etc. The distinction I drew when working on indexing towards the beginning of 3.2 code was between the Retriever (the spider itself) and the Transport. The latter basically handles a specific URL schema. So there's now HtHTTP, HtFile, HtNNTP and External transport classes. It sounds like you're talking more about Transport-type concepts. Again, I think it's the URL that's the critical point. Otherwise how are the search results useful? How do you "jump to" a particular result from the output? The databases tie the URL to the DocID which is used internally, but this doesn't seem particularly useful to the outside world. Maybe I'm misunderstanding you. > unix-style mail files, > XML files (given a spec.. see XSLT), > ... > other document formats.. Certainly all of these would work fine within the current Parser class with perhaps some additional revision. The current trend has been to cut out Parser classes in favor of the external parsers and external converters. Either approach can work *if* the parsers are maintainable. (The previous PostScript and PDF parsers weren't.) If it seems that the HTML parser is somehow "special" it's simply that the other remaining parser classes are much simpler. There's very little left in the 3.2 code that cares whether it's HTML or XML or whatever. > Again, as I write and test this stuff I'll forward .tgz files with > a script to do the setup and diff-ing. Feel free to use it or pipe it to > /dev/null if all you want is a web-crawling search engine. ;-) Your work is appreciated. I'm just trying to point out a few things as someone who's been around for a while. 1) We've been moving in this direction with 3.2 and for most purposes, IMHO it's already there. Certainly if you have other suggestions, feel free to contribute. 2) It's better not to reinvent the wheel. The less code that needs to be maintained, generally the better. Do we really need new Retriever classes, or do we need to refactor what we have? 3) There are differing philosophies on the Parser class and what should be internal ht://Dig code and what should be plugged through the external parsers and converters. (As far as #3, I'm personally all for new Parser subclasses if they won't become headaches like the old PDF.cc became.) -Geoff |
From: Neal R. <ne...@ri...> - 2002-01-29 19:29:33
|
On Tue, 29 Jan 2002, Geoff Hutchison wrote: > I'm not sure why you'd want htmerge and not htpurge in this case (not > to mention htdump and htload). I would think that in a library, > you're less likely to need to merge whole databases together and more > likely to want to delete URLs, etc. This is the first pass, thought I'd throw it out for comment. The merging of databases was useful to me, before I realized that htdig could do the incremental adding/merging on its own. ;-) Definitely purge, dump, load and the other utilities are useful and I'll add them ASAP. > > sprintf(htdig_params.configFile, "/etc/htdig/htdig.conf"); > > strcpy(htdig_params.credentials,""); > > strcpy(htdig_params.max_hops, ""); //9 digit limit > > strcpy(htdig_params.minimalFile, ""); > > strcpy(htdig_params.URL, ""); //stdin HTTP addrs > > Maybe this is just me, but I don't see how this is any more useful > than using brackets (e.g. in the defaults.cc code) or simply: > config["authorization"] = ""; > config["max_hops"] = "-1"; > etc. Nope, I just wrote something very quick that had a direct mapping to the command line parameters available to htmerge & htdig. More than one way to skin a cat. I like libraries that have one main API header & documentation... lower initial startup time. > >to mix and match them. Currently the parser classes receive a Retriever > >object as a parameter and issue callback-style calls to the Retriever > >object. > > This seems a little strange to me. Clearly you could do this, but > since the Parser and Retriever classes are tied tightly in any model, > if you would have to write completely new Parsers for any new > Retriever you wrote. Point taken..The virtual base class would be more of a skeleton for future developers.. a kind of spec for what methods need to be defined.. virtual void functions, etc. The current Retriever is highly build around the idea of webpages and HTTP (of course)... you could write a retriever to get docs directly out of a database, from a file, scp, POP, via parameters, etc. Here are some new parser ideas: unix-style mail files, XML files (given a spec.. see XSLT), flat-file databases, usenet files, netscape/IE/Other bookmark files, other document formats.. (via a separate library with StarOffice 6.0 importing code) For now, I was thinking of writing an additional method for the Parsers that will accept as a parameter the alternative Retriever. Basically this retriever will be given a document ID, a title, a string of 'meta data' and a string representing the document. Very general purpose & very similar, it's just not a HTML document, but does have some structure. htdig_index_open(&htdig_params); //start loop htdig_index_document(.....); //end loop htdig_index_close(); > It would be more elegant to have htsearch code that supported such an > API. At the moment, Torsten has done what he can do--massage the > output from htsearch. This is one reason for a new query parser. (And > yes, the new query parser that Quim donated could be styled to have > different query syntax if you want.) Great. I'll forward the wrapper as I get them written and tested. Torsten's page/code is good, no knock on him. We've got lots of experience writing these wrappers. It actually ends up being a separate shared library (libhtdigphp.so) that receives parameters and calls libhtdig.so functions. This is necessary because you don't want to have the main libhtdig.so be dependent on PHP header files to compile. I haven't gotten in too deep in the htsearch code, but for now, I'll be just repackaging data and breaking-up/reusing existing functions. When the new query parser is integrated I'll change the PHP wrappers to support that. Again, as I write and test this stuff I'll forward .tgz files with a script to do the setup and diff-ing. Feel free to use it or pipe it to /dev/null if all you want is a web-crawling search engine. ;-) Thanks. -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
From: Geoff H. <ghu...@ws...> - 2002-01-29 17:26:57
|
Seems like a great idea to me. -Geoff ---------- Forwarded message ---------- Date: Tue, 29 Jan 2002 10:06:24 -0500 (EST) From: Don Sizemore <dl...@me...> To: ghu...@ws... Subject: ht://dig / ftp / mirroring / other Hi Geoff, I know it's old news that Sourceforge has stopped its FTP services, but I'd still like to offer you FTP space on ibiblio for htdig, either mirrored from sourceforge or managed by you or another volunteer. We already host a sourceforge mirror, so this would be easy and brainless for us. Your users would benefit from FTP services once more. If you need ibiblio accounts, CVS, mysql, or mailman down the road these are also no problem. Just let me know? Thanks, Donald ibiblio.org formerly known as SunSITE 919.843.8215 and stoof. |
From: Gilles D. <gr...@sc...> - 2002-01-29 17:09:21
|
According to J. op den Brouw: > When running 'make install' all's installed well > > (.././install ......) > > but when doing install-strip, install fails because of > wrong path > > (./install ....) > > I did look at the makefile but couldn't locate the problem. Using > HP-UX 10.20 If I'm not mistaken, this problem would have been in 3.1.x since 3.1.0b3, when the install-strip target was added. The problem is it passes a modified INSTALL_PROGRAM makefile variable to the subdirectories, overriding the ones in those subdirectories' Makefiles. This is OK for absolute paths, but not for relative ones. I don't have the time today to work out a proper fix for this, but I'll try tomorrow unless someone else can beat me to it. I'm thinking it should be handled with a case statement in the loop that handles INSTALLDIRS in the install target, to add the "../" if INSTALL_PROGRAM is relative. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Gilles D. <gr...@sc...> - 2002-01-29 15:20:45
|
According to Neal Richter: > I sent a file called libhtdig-3.2.0.b4.tgz to Gilles and asked him > to stick it in the 'contrib' directory. OK, it's now in http://www.htdig.org/files/contrib/other/ -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: J. op d. B. <MSQ...@st...> - 2002-01-29 14:39:26
|
When running 'make install' all's installed well (.././install ......) but when doing install-strip, install fails because of wrong path (./install ....) I did look at the makefile but couldn't locate the problem. Using HP-UX 10.20 --jesse -------------------------------------------------------------------- J. op den Brouw Johanna Westerdijkplein 75 Haagse Hogeschool 2521 EN DEN HAAG Faculty of Engeneering Netherlands Electrical Engeneering +31 70 4458936 -------------------- J.E...@st... -------------------- Linux - because reboots are for hardware changes |
From: Geoff H. <ghu...@ws...> - 2002-01-29 14:34:43
|
At 6:37 PM -0700 1/28/02, Neal Richter wrote: >The libhtdig_htdig.cc is a callable replacement for the htdig executable. >The libhtdig_htmerge.cc is a callable replacement for the htmerge executable. I'm not sure why you'd want htmerge and not htpurge in this case (not to mention htdump and htload). I would think that in a library, you're less likely to need to merge whole databases together and more likely to want to delete URLs, etc. > sprintf(htdig_params.configFile, "/etc/htdig/htdig.conf"); > strcpy(htdig_params.credentials,""); > strcpy(htdig_params.max_hops, ""); //9 digit limit > strcpy(htdig_params.minimalFile, ""); > strcpy(htdig_params.URL, ""); //stdin HTTP addrs Maybe this is just me, but I don't see how this is any more useful than using brackets (e.g. in the defaults.cc code) or simply: config["authorization"] = ""; config["max_hops"] = "-1"; etc. >to mix and match them. Currently the parser classes receive a Retriever >object as a parameter and issue callback-style calls to the Retriever >object. This seems a little strange to me. Clearly you could do this, but since the Parser and Retriever classes are tied tightly in any model, if you would have to write completely new Parsers for any new Retriever you wrote. If there's something that can be improved in the current system, that's great and let's do that. It probably needs updating and/or refactoring given the new database model we're using. But as many of us can attest, getting the parsers to work "correctly" on all the strange files out there is a difficult task. No one is going to want to write an HTML parser by themselves. Perhaps it would make more sense to me if you had an example of what sort of Retriever you're thinking about. >would be more elegant to have a set of PHP wrappers written in C that >provide an interface back and forth to the core searching code. It would be more elegant to have htsearch code that supported such an API. At the moment, Torsten has done what he can do--massage the output from htsearch. This is one reason for a new query parser. (And yes, the new query parser that Quim donated could be styled to have different query syntax if you want.) -Geoff |
From: Geoff H. <ghu...@ws...> - 2002-01-29 05:01:16
|
On Mon, 28 Jan 2002, Neal Richter wrote: > 1. Any opposition to using GNU indent to pretty up the code a bit? I > don't care about the specific style options used. Nope. I think the general feeling is that there shouldn't be any mandated indentation (i.e indent this way or your patch won't be accepted). But I clean up sections that I find messy myself. > 2. It would be wonderful if there was a way to disable/enable different > compile options in the Makefiles or configure script. Use -O0 rather than > -O2, disable -g, etc. Pretty minor, and I've got it fixed for > me... anyone else interested? Sounds like a great idea, but this already works w/o modification: CFLAGS="-O2" CXXFLAGS="-O2" ./configure or my favorite: make clean; CFLAGS="-g" CXXFLAGS="-g" make > 3. Now that I've got my feet wet here, what exactly needs to be done to > sync-up the current mifluz code with what is in the htdig/db directory? > Are we picking and choosing or tossing the old stuff or what? At the moment, I'm spending all my ht://Dig time on getting htdig-3.1.6 out the door. I've written 2002-02-01 in the release notes, so we'll hopefully keep that. After that, I'm going to merge the 3-2-x branch back into the main CVS tree and start in on the mifluz merge. In the htword/, db/, and htdb/ directories, there are essentially no useful modifications from ht://Dig. (We have some bugfixes, etc., but there obsolete relative to the mifluz code coming in.) The htlib/ directory requires more careful merging. Then, of course, it's a matter of fixing up any mangling. I skimmed through the API changes this afternoon and I think I understand what Loic has done in the new mifluz versions. There's also a small matter that the current mifluz requires the iconv library. -Geoff |
From: Neal R. <ne...@ri...> - 2002-01-29 02:40:55
|
On Mon, 28 Jan 2002, Tod Thomas wrote: > Can someone please set the record straight on this? > > Everything I have come across says that web spiders cannot index javascript since they cannot parse and interpret > it. My impression is this is particularly true for sites that use JavaScript almost exclusively for things like > navigation, drop down menus, and the like. > > Is this true, or is a solution in place that I don't know about? Can anybody point me to documentation that > discusses Javascript and how search indexing is affected? Are there commercial solutions available? That would depend on what exactly you are doing. As far as the javascript code goes, that won't get indexed as javascript code is encapsulated in HTML comment tags... assuming the web-spider's HTML parsing code isn't brain-dead. If you are talking about a fully dynamic page with lots of parameters and dynamic functionality it becomes very hard. The spider would need to have lots of web-browser functionality incorporated into it to load up a javascript environment and treat the page as a program-to-run, rather than a file-to-parse. The spider would have to understand http GET/POST params how those parameters influenced the page. This is the DHTML analog of spidering classical CGI apps, most spiders can't 'go through' the forms. There was a company recently that actually got some momentary notoriety by announcing that it had developed spidering technology that would fill in web forms and submit them to spider the pages 'behind' the forms. Javascript dependent navigation also would need a kind of browser-like spider that treats the page as a program-to-run. Dynamic PHP pages are a bit easier, especially if all the parameters are on the URL and no forms are used. A spider sees them as standard URLs and follows them, assuming the spider's URL parser doesn't kill of the parameters before it fetches the page. Of course nothing stops you from using PHP to supply javascript menus... using PHP won't help spidering for that situation. Summary: CGI with forms: NO (for the most part) DHTML with Javascript: Depends on the level of usage, the more dynamic (dependent on parameters and javascript code output) the page is, the less successful spidering will be. PHP (or other server-side scripting lang) w/o hidden parameters: Spiders can do pretty well here. The more user-side interactive the page is, generally the worse off you will be. Of course a lot of these drawbacks can be somewhat compensated for by designing a web-site/web-app with spidering in mind. That involves lots of testing and good default behavior.. a lot like making your pages visible/usable on now ancient web-browsers like Mozilla 1.0, early versions of Netscape, etc. After reading the Retriever & HTML parsing code, htdig pretty much treats web pages as documents-to-parse, and not programs-to-run. So without good default behavior, it may not be too sucessfull on pages with dynamic content for navigation purposes. Feel free to correct me if I missed something. ;-) Thanks. -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
From: Neal R. <ne...@ri...> - 2002-01-29 01:55:04
|
Hey again, Here are a couple comments/questions on the current 3.2.0b4 code. 1. Any opposition to using GNU indent to pretty up the code a bit? I don't care about the specific style options used. I'd like to make & contribute bug fixes, but honestly some of the code is very hard to follow because of formatting reasons. See htdig/htdig.cc lines 583-626 for a mild example of nested-ifs-gone-wild. This may make security auditing a bit more enjoyable ;-) 2. It would be wonderful if there was a way to disable/enable different compile options in the Makefiles or configure script. Use -O0 rather than -O2, disable -g, etc. Pretty minor, and I've got it fixed for me... anyone else interested? I did it by mucking with the configure scripts. Change parameters at the top of the files, run ./configure and viola. 3. Now that I've got my feet wet here, what exactly needs to be done to sync-up the current mifluz code with what is in the htdig/db directory? Are we picking and choosing or tossing the old stuff or what? Thanks! -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
From: Neal R. <ne...@ri...> - 2002-01-29 01:40:57
|
Hey, I sent a file called libhtdig-3.2.0.b4.tgz to Gilles and asked him to stick it in the 'contrib' directory. It's part of a larger project to restructure the code for these goals: 1. All htdig code will be contained in the library 2. This library will be able to index data & respond to querying APIs from PHP or any other cgi-bin or program. Unzip it in your base htdig directory. It contains these files: libhtdig/prepare.sh libhtdig/Makefile libhtdig/libhtdig_htdig.cc libhtdig/libhtdig_htmerge.cc libhtdig/libhtdig_api.h after doing a configure & make on the latest snapshot, you can run prepare.sh and make within the new 'libhtdig' directory. The prepare.sh copies the htdig/htdig code files to this directory. The libhtdig_htdig.cc is a callable replacement for the htdig executable. The libhtdig_htmerge.cc is a callable replacement for the htmerge executable. The makefile compiles these files using htdig's standard method, then links these files and all files necessary for building the 6 htdig .so files into ONE libhtdig.3.2.0.so. Here's an example of its use: htdig_parameters_struct htdig_params; htdig_params.debug = 1; htdig_params.initial = TRUE; htdig_params.create_text_database = FALSE; htdig_params.report_statistics = FALSE; htdig_params.alt_work_area = FALSE; sprintf(htdig_params.configFile, "/etc/htdig/htdig.conf"); strcpy(htdig_params.credentials,""); strcpy(htdig_params.max_hops, ""); //9 digit limit strcpy(htdig_params.minimalFile, ""); strcpy(htdig_params.URL, ""); //stdin HTTP addrs htdig_index_open(&htdig_params); htdig_index_urls(); htdig_index_close(); Here's the TODO: 1. Generalize the 'Retriever' class in htdig-exe It would be nicer to have a base class for the Retriever class, the current class would be inherited from the new base class. This would enable developers to create their own retrievers and parsers and be able to mix and match them. Currently the parser classes receive a Retriever object as a parameter and issue callback-style calls to the Retriever object. You could derive new classes from the current Retriever object, but you would carry around all kinds of junk that may be unneeded if you are indexing documents from other sources. see: htdig_index_open(&htdig_params); htdig_index_document(.....); htdig_index_close(); htdig_merge(&htmerge_params); //merges new index with existing index. 2. Include some code from htsearch & create PHP wrapper functions for the searching code. The current htsearch-php3.0.1.1 module written Torsten Neuer calls the htsearch cgi-bin and repackages the output. PHP is a very flexible web-language with strong string-manipulation ability (much like perl). It would be more elegant to have a set of PHP wrappers written in C that provide an interface back and forth to the core searching code. While I'm not suggesting that this replace the htsearch cgi-bin, alot of the query parsing code could be replaced with fewer lines of PHP. A developer could even write a different query language for the core searching code (think 'SearchSQL' or something like it). This will be especially powerful once the 'indexable fields' feature is incorporated. At the end of this process htdig can be integrated with other software in a variety of ways.. in some cases taking the place of a SQL database for storing, querying & optionally displaying documents. We will hopefully be using this as a document archiving tool that will eliminate lots of old/infrequently used documents from a SQL database. I'm going to try to keep this project as a patch from the current snapshot. Unzip it, run a script file (copies files, diffs code), make and you are done. Feedback is welcome! -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
From: Gilles D. <gr...@sc...> - 2002-01-28 23:15:15
|
According to Geoff Hutchison: > Modified Files: > Tag: htdig-3-1-x > EndingsDB.cc > Log Message: > Implement more flexible test for rx/regex, which > will check for rxposix.h if --with-rx is supplied, will "fall > back" to regex test if rxposix.h isn't available and will only use > the htlib/ code and header for regex compile. > (should solve Gilles and Joe's problems completely.) Yes, 3.1.6 now builds properly on Red Hat 4.2, and uses the bundled regex.c automatically. Thanks! -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Tod T. <tt...@ch...> - 2002-01-28 15:37:29
|
Can someone please set the record straight on this? Everything I have come across says that web spiders cannot index javascript since they cannot parse and interpret it. My impression is this is particularly true for sites that use JavaScript almost exclusively for things like navigation, drop down menus, and the like. Is this true, or is a solution in place that I don't know about? Can anybody point me to documentation that discusses Javascript and how search indexing is affected? Are there commercial solutions available? Thanks - Tod |
From: Joe R. J. <jj...@cl...> - 2002-01-27 20:34:24
|
On Thu, 24 Jan 2002, Geoff Hutchison wrote: > Date: Thu, 24 Jan 2002 14:00:16 -0500 (EST) > From: Geoff Hutchison <ghu...@ws...> > To: htd...@li... > Subject: [htdig-dev] Re: --with-rx configure switch > > Attached is a patch for a --with-rx flag to configure which ignores the > regex test and won't build the regex.o file. It will also switch HtRegex > to use the rxposix header, which should use rx like Joe was suggesting. > > (Joe, I fixed the problem with the previous patch I sent you. This one > will definitely not build regex.o.) I did one better today and downloaded htdig-3.1.6-012702. It works great; all I needed to do was to configure it --with-rx. The configuration took a bit longer than before, specially when regex was being checked; it complied quickly, and dug real fast;) Thank you Geoff. Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |
From: Gilles D. <gr...@sc...> - 2002-01-26 00:11:16
|
According to Geoff Hutchison: > On Fri, 25 Jan 2002, Gilles Detillieux wrote: > > I ran htfuzzy/htfuzzy endings, and it ran further but still segfaulted. > > What's the segfault--this obviously could be a different error > entirely. After all, I'm assuming that broken regex support would show up > in other programs. Do you have a backtrace? Not a particularly useful one... $ gdb htfuzzy/htfuzzy core GDB is free software and you are welcome to distribute copies of it under certain conditions; type "show copying" to see the conditions. There is absolutely no warranty for GDB; type "show warranty" for details. GDB 4.16 (i586-unknown-linux), Copyright 1996 Free Software Foundation, Inc... Core was generated by `htfuzzy/htfuzzy -v endings'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/lib/libz.so.1.0.4...done. Reading symbols from /usr/lib/libstdc++.so.27.1.4...done. Reading symbols from /lib/libm.so.5.0.6...done. Reading symbols from /lib/libc.so.5.3.12...done. Reading symbols from /lib/ld-linux.so.1...done. #0 0x400ae0e2 in rx_cache_get_superstate () (gdb) bt #0 0x400ae0e2 in rx_cache_get_superstate () #1 0x400e3dc8 in __DTOR_END__ () #2 0x1 in ?? () (gdb) This is on Red Hat 4.2, a very old, libc5-based Linux system. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2002-01-25 23:55:01
|
On Fri, 25 Jan 2002, Gilles Detillieux wrote: Well, this is why we have pre-releases... > > I think to make this all work, we need to rename the bundled regex.h to > > something like htregex.h to avoid conflicts, as well as put some hooks > > in the bundled regex.c code to disable it all if you need to use the > > C library code instead. What do you think, Geoff? I remember seeing this, but it's also about the same time I got swamped here, so I'm not surprised it disappeared in my mailbox. > I don't think I ever got a response to that. Obviously renaming to gregex.h and cleaning up the #includes in EndingsDB.cc is a good idea. As I stated, there are also minor problems with the current configure test: it doesn't check that there *is* a rxposix.h header (never trust the user) or that the user didn't supply --without-rx (ditto). So I still have to refine it. > I ran htfuzzy/htfuzzy endings, and it ran further but still segfaulted. What's the segfault--this obviously could be a different error entirely. After all, I'm assuming that broken regex support would show up in other programs. Do you have a backtrace? > So, problem 2: we need a better test for regex, that will work if > htfuzzy works, and will fail if htfuzzy fails. Here is the relevant We do need to improve the test, though if we rename the included header to gregex.h and use this in the test #include, this should solve the issue with using the bundled code with the system's header. I'll try to get all of this in the snapshot so people can give it a try. -Geoff |
From: Gilles D. <gr...@sc...> - 2002-01-25 21:27:34
|
Stop the presses! We have more regex problems in 3.1.6! There are at least two problems remaining with the regex tests in 3.1.6 that I just discovered. It turns out that "htfuzzy endings" has been segfaulting on me for some time, and for some reason that fact was never reported to me by the rundig script I was running. Anyway, I found two reasons for it segfaulting. First of all, back on Nov. 29, I wrote... > I think we need to fix htfuzzy/EndingsDB.cc to check the setting > of HAVE_BROKEN_REGEX and use the appropriate header file. Come to > think of it, I think there's a problem with how HAVE_BROKEN_REGEX is > handled in htlib/HtRegex.h too, because simply using #include <regex.h> > doesn't guarantee that the compiler won't use the bundled one instead, > as the Makefile.config file puts a -I../htlib in the compiler flags. > I think to make this all work, we need to rename the bundled regex.h to > something like htregex.h to avoid conflicts, as well as put some hooks > in the bundled regex.c code to disable it all if you need to use the > C library code instead. What do you think, Geoff? I don't think I ever got a response to that. This is at least part of the problem. EndingsDB.cc doesn't check HAVE_BROKEN_REGEX, so it always includes <regex.h>, but it turns out that this is still the regex.h in htlib, because of the -I../htlib flag. However, combining the htlib/regex.h with the C library regex.o can be a bad thing, because the data structures may be incompatible. This seems to be the case on my system, causing htfuzzy endings to segfault right at the start. So, problem 1: we need to fix things up so the bundled regex.h header isn't included at all by compiled code, if we're not using the corresponding bundled regex.c code. OK, I thought, I'll just rename htlib/regex.h to htlib/gregex.h so it doesn't get included, remove the defective htfuzzy/EndingsDB.o (and also htlib/HtRegex.o for good measure) and re-make. It builds fine, and so I ran htfuzzy/htfuzzy endings, and it ran further but still segfaulted. So, even with the C library regex.h AND regex.o, htfuzzy still doesn't work reliably on my system. This confirms my earlier attempts at trying Joe Jah's FAQ 5.14 fix on my system with earlier 3.1.x releases, when it failed there too. So, on my system the bundled regex code works (it certainly did in earlier 3.1.x releases where it was used unconditionally), but the C library regex code doesn't work. So, why is ./configure setting HAVE_BROKEN_REGEX, which suggests the bundled regex code failed on my system. Obviously the test failed, even though htfuzzy works correctly with the bundled regex code. (I just confirmed this by forcing 3.1.6 manually to use the bundled regex code.) So, problem 2: we need a better test for regex, that will work if htfuzzy works, and will fail if htfuzzy fails. Here is the relevant config.log output. It shows that the reason the test fails has nothing to do with any problems specific to regex.c not running on my system, but rather it seems it's because of missing definitions in the test or something like that. I suspect it might be trying to compile the bundled htlib/regex.c using the system's /usr/include/regex.h. configure:2513: checking if we should use the included regex? configure:2531: gcc -o conftest -O2 -m486 -fno-strength-reduce conftest.c -lz 1>&5 In file included from configure:2524: ./htlib/regex.c:1833: parse error before `_RE_ARGS' ./htlib/regex.c:1836: parse error before `_RE_ARGS' ./htlib/regex.c:1837: parse error before `_RE_ARGS' ./htlib/regex.c:1839: parse error before `_RE_ARGS' ./htlib/regex.c:1841: parse error before `_RE_ARGS' ./htlib/regex.c:1843: parse error before `_RE_ARGS' ./htlib/regex.c:1846: parse error before `_RE_ARGS' ./htlib/regex.c:1859: parse error before `_RE_ARGS' ./htlib/regex.c:2233: parse error before `_RE_ARGS' ./htlib/regex.c: In function `regex_compile': ./htlib/regex.c:2313: `RE_TRANSLATE_TYPE' undeclared (first use this function) ./htlib/regex.c:2313: (Each undeclared identifier is reported only once ./htlib/regex.c:2313: for each function it appears in.) ./htlib/regex.c:2313: parse error before `translate' ./htlib/regex.c:2405: structure has no member named `used' ./htlib/regex.c:2453: `translate' undeclared (first use this function) ./htlib/regex.c:2465: invalid operands to binary - ./htlib/regex.c:2465: invalid operands to binary - ./htlib/regex.c:2480: invalid operands to binary - ./htlib/regex.c:2480: invalid operands to binary - ./htlib/regex.c:2573: invalid operands to binary - ./htlib/regex.c:2573: invalid operands to binary - ./htlib/regex.c:2601: invalid operands to binary - ./htlib/regex.c:2601: invalid operands to binary - ./htlib/regex.c:2615: invalid operands to binary - ./htlib/regex.c:2615: invalid operands to binary - ./htlib/regex.c:2626: invalid operands to binary - ./htlib/regex.c:2626: invalid operands to binary - ./htlib/regex.c:3105: invalid operands to binary - ./htlib/regex.c:3105: invalid operands to binary - ./htlib/regex.c:3111: invalid operands to binary - ./htlib/regex.c:3111: invalid operands to binary - ./htlib/regex.c:3119: invalid operands to binary - ./htlib/regex.c:3119: invalid operands to binary - ./htlib/regex.c:3671: invalid operands to binary - ./htlib/regex.c:3673: invalid operands to binary - ./htlib/regex.c:3674: invalid operands to binary - ./htlib/regex.c:3684: invalid operands to binary - ./htlib/regex.c:3685: invalid operands to binary - ./htlib/regex.c:3685: invalid operands to binary - ./htlib/regex.c:3717: invalid operands to binary - ./htlib/regex.c:3717: invalid operands to binary - ./htlib/regex.c:3763: invalid operands to binary - ./htlib/regex.c:3763: invalid operands to binary - ./htlib/regex.c:3779: invalid operands to binary - ./htlib/regex.c:3779: invalid operands to binary - ./htlib/regex.c:3808: invalid operands to binary - ./htlib/regex.c:3808: invalid operands to binary - ./htlib/regex.c:3900: invalid operands to binary - ./htlib/regex.c:3900: invalid operands to binary - ./htlib/regex.c:3921: invalid operands to binary - ./htlib/regex.c:3921: invalid operands to binary - ./htlib/regex.c:4023: invalid operands to binary - ./htlib/regex.c:4023: invalid operands to binary - ./htlib/regex.c:4031: invalid operands to binary - ./htlib/regex.c:4031: invalid operands to binary - ./htlib/regex.c:4038: invalid operands to binary - ./htlib/regex.c:4038: invalid operands to binary - ./htlib/regex.c:4044: invalid operands to binary - ./htlib/regex.c:4044: invalid operands to binary - ./htlib/regex.c:4050: invalid operands to binary - ./htlib/regex.c:4050: invalid operands to binary - ./htlib/regex.c:4056: invalid operands to binary - ./htlib/regex.c:4056: invalid operands to binary - ./htlib/regex.c:4062: invalid operands to binary - ./htlib/regex.c:4062: invalid operands to binary - ./htlib/regex.c:4068: invalid operands to binary - ./htlib/regex.c:4068: invalid operands to binary - ./htlib/regex.c:4086: invalid operands to binary - ./htlib/regex.c:4086: invalid operands to binary - ./htlib/regex.c:4147: invalid operands to binary - ./htlib/regex.c:4147: invalid operands to binary - ./htlib/regex.c:4152: invalid operands to binary - ./htlib/regex.c:4152: invalid operands to binary - ./htlib/regex.c:4169: `RE_NO_POSIX_BACKTRACKING' undeclared (first use this function) ./htlib/regex.c:4170: invalid operands to binary - ./htlib/regex.c:4170: invalid operands to binary - ./htlib/regex.c:4183: structure has no member named `used' ./htlib/regex.c:4183: invalid operands to binary - ./htlib/regex.c: At top level: ./htlib/regex.c:4245: warning: type mismatch with previous implicit declaration ./htlib/regex.c:4162: warning: previous implicit declaration of `store_op1' ./htlib/regex.c:4245: warning: `store_op1' was previously implicitly declared to return `int' ./htlib/regex.c:4245: warning: `store_op1' was declared implicitly `extern' and later `static' ./htlib/regex.c:4259: warning: type mismatch with previous implicit declaration ./htlib/regex.c:3953: warning: previous implicit declaration of `store_op2' ./htlib/regex.c:4259: warning: `store_op2' was previously implicitly declared to return `int' ./htlib/regex.c:4259: warning: `store_op2' was declared implicitly `extern' and later `static' ./htlib/regex.c:4275: warning: type mismatch with previous implicit declaration ./htlib/regex.c:3780: warning: previous implicit declaration of `insert_op1' ./htlib/regex.c:4275: warning: `insert_op1' was previously implicitly declared to return `int' ./htlib/regex.c:4275: warning: `insert_op1' was declared implicitly `extern' and later `static' ./htlib/regex.c:4295: warning: type mismatch with previous implicit declaration ./htlib/regex.c:3928: warning: previous implicit declaration of `insert_op2' ./htlib/regex.c:4295: warning: `insert_op2' was previously implicitly declared to return `int' ./htlib/regex.c:4295: warning: `insert_op2' was declared implicitly `extern' and later `static' ./htlib/regex.c:4316: warning: type mismatch with previous implicit declaration ./htlib/regex.c:2464: warning: previous implicit declaration of `at_begline_loc_p' ./htlib/regex.c:4316: warning: `at_begline_loc_p' was previously implicitly declared to return `int' ./htlib/regex.c:4316: warning: `at_begline_loc_p' was declared implicitly `extern' and later `static' ./htlib/regex.c:4335: warning: type mismatch with previous implicit declaration ./htlib/regex.c:2479: warning: previous implicit declaration of `at_endline_loc_p' ./htlib/regex.c:4335: warning: `at_endline_loc_p' was previously implicitly declared to return `int' ./htlib/regex.c:4335: warning: `at_endline_loc_p' was declared implicitly `extern' and later `static' ./htlib/regex.c:4357: warning: type mismatch with previous implicit declaration ./htlib/regex.c:4082: warning: previous implicit declaration of `group_in_compile_stack' ./htlib/regex.c:4357: warning: `group_in_compile_stack' was previously implicitly declared to return `int' ./htlib/regex.c:4357: warning: `group_in_compile_stack' was declared implicitly `extern' and later `static' ./htlib/regex.c:4478: warning: type mismatch with previous implicit declaration ./htlib/regex.c:3181: warning: previous implicit declaration of `compile_range' ./htlib/regex.c:4478: warning: `compile_range' was previously implicitly declared to return `int' ./htlib/regex.c:4478: warning: `compile_range' was declared implicitly `extern' and later `static' ./htlib/regex.c: In function `compile_range': ./htlib/regex.c:4480: parse error before `RE_TRANSLATE_TYPE' ./htlib/regex.c:4525: subscripted value is neither array nor pointer ./htlib/regex.c:4531: subscripted value is neither array nor pointer ./htlib/regex.c:4535: subscripted value is neither array nor pointer ./htlib/regex.c:4535: subscripted value is neither array nor pointer ./htlib/regex.c: In function `re_compile_fastmap': ./htlib/regex.c:4593: structure has no member named `used' ./htlib/regex.c:4617: structure has no member named `can_be_null' ./htlib/regex.c:4626: structure has no member named `can_be_null' ./htlib/regex.c:4651: structure has no member named `can_be_null' ./htlib/regex.c:4728: structure has no member named `can_be_null' ./htlib/regex.c:4829: structure has no member named `can_be_null' ./htlib/regex.c:4882: structure has no member named `can_be_null' ./htlib/regex.c: In function `re_search_2': ./htlib/regex.c:4983: syntax error before `translate' ./htlib/regex.c:5001: structure has no member named `used' ./htlib/regex.c:5002: warning: dereferencing `void *' pointer ./htlib/regex.c:5002: void value not ignored as it ought to be ./htlib/regex.c:5004: warning: dereferencing `void *' pointer ./htlib/regex.c:5004: void value not ignored as it ought to be ./htlib/regex.c:5036: structure has no member named `can_be_null' ./htlib/regex.c:5051: `translate' undeclared (first use this function) ./htlib/regex.c:5075: structure has no member named `can_be_null' ./htlib/regex.c: At top level: ./htlib/regex.c:5267: parse error before `_RE_ARGS' ./htlib/regex.c:5270: parse error before `_RE_ARGS' ./htlib/regex.c:5273: parse error before `_RE_ARGS' ./htlib/regex.c:5276: parse error before `_RE_ARGS' ./htlib/regex.c: In function `re_match_2_internal': ./htlib/regex.c:5399: structure has no member named `used' ./htlib/regex.c:5407: `RE_TRANSLATE_TYPE' undeclared (first use this function) ./htlib/regex.c:5407: parse error before `translate' ./htlib/regex.c:5438: `active_reg_t' undeclared (first use this function) ./htlib/regex.c:5438: parse error before `lowest_active_reg' ./htlib/regex.c:5920: `translate' undeclared (first use this function) ./htlib/regex.c:5954: parse error before `r' ./htlib/regex.c:5954: `r' undeclared (first use this function) ./htlib/regex.c:5954: `lowest_active_reg' undeclared (first use this function) ./htlib/regex.c:5954: `highest_active_reg' undeclared (first use this function) ./htlib/regex.c:5968: parse error before `r' ./htlib/regex.c:6340: parse error before `r' ./htlib/regex.c:6525: parse error before `this_reg' ./htlib/regex.c:6525: `this_reg' undeclared (first use this function) ./htlib/regex.c:6595: parse error before `r' ./htlib/regex.c:6679: parse error before `this_reg' ./htlib/regex.c:6736: parse error before `this_reg' ./htlib/regex.c:6918: parse error before `dummy_low_reg' ./htlib/regex.c:6923: parse error before `this_reg' ./htlib/regex.c:6923: `dummy_high_reg' undeclared (first use this function) ./htlib/regex.c:6923: parse error before `fail_stack' ./htlib/regex.c:6924: `dummy_low_reg' undeclared (first use this function) ./htlib/regex.c:6924: parse error before `fail_stack' ./htlib/regex.c:6966: parse error before `this_reg' ./htlib/regex.c:6979: parse error before `this_reg' ./htlib/regex.c:7184: parse error before `r' ./htlib/regex.c:7193: parse error before `r' ./htlib/regex.c:7209: parse error before `this_reg' ./htlib/regex.c:7209: parse error before `fail_stack' ./htlib/regex.c:7210: parse error before `fail_stack' ./htlib/regex.c: At top level: ./htlib/regex.c:7276: warning: type mismatch with previous implicit declaration ./htlib/regex.c:6360: warning: previous implicit declaration of `group_match_null_string_p' ./htlib/regex.c:7276: warning: `group_match_null_string_p' was previously implicitly declared to return `int' ./htlib/regex.c:7276: warning: `group_match_null_string_p' was declared implicitly `extern' and later `static' ./htlib/regex.c:7388: warning: type mismatch with previous implicit declaration ./htlib/regex.c:7358: warning: previous implicit declaration of `alt_match_null_string_p' ./htlib/regex.c:7388: warning: `alt_match_null_string_p' was previously implicitly declared to return `int' ./htlib/regex.c:7388: warning: `alt_match_null_string_p' was declared implicitly `extern' and later `static' ./htlib/regex.c:7425: warning: type mismatch with previous implicit declaration ./htlib/regex.c:7409: warning: previous implicit declaration of `common_op_match_null_string_p' ./htlib/regex.c:7425: warning: `common_op_match_null_string_p' was previously implicitly declared to return `int' ./htlib/regex.c:7425: warning: `common_op_match_null_string_p' was declared implicitly `extern' and later `static' ./htlib/regex.c:7513: warning: `bcmp_translate' was declared implicitly `extern' and later `static' ./htlib/regex.c: In function `bcmp_translate': ./htlib/regex.c:7515: parse error before `RE_TRANSLATE_TYPE' ./htlib/regex.c:7526: subscripted value is neither array nor pointer ./htlib/regex.c:7526: subscripted value is neither array nor pointer ./htlib/regex.c: In function `re_compile_pattern': ./htlib/regex.c:7549: argument `length' doesn't match prototype /usr/include/rx.h:1737: prototype declaration ./htlib/regex.c: In function `regcomp': ./htlib/regex.c:7698: structure has no member named `used' ./htlib/regex.c:7708: `RE_TRANSLATE_TYPE' undeclared (first use this function) ./htlib/regex.c:7708: parse error before `malloc' ./htlib/regex.c: In function `regfree': ./htlib/regex.c:7898: structure has no member named `used' configure: failed program was: #line 2522 "configure" #include "confdefs.h" #include "./htlib/regex.c" int main() { regex_t re; return regcomp(&re, "ht.*Dig", REG_ICASE); } -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2002-01-24 20:42:54
|
On Thu, 24 Jan 2002, Gilles Detillieux wrote: > > 1. better configure test for regex problems on BSDi > > 2. way to override "no server" problem > > 3. handle noindex_start & noindex_end as string lists in HTML parser > > 4. a "match all documents" mechanism in htsearch > > 5. a way of specifying relative date ranges in htsearch > > 6. release notes > > 7. merge maindocs updates into htdoc and vice versa > to enhance noindex_start. So, I think all that's missing now is 2, > the rest of 7, and maybe a sample whatsnew.html search form that shows > how to use 4 & 5. If we go the route of an attribute to turn off #2, I can take that in a few minutes. The rest of 7 will take a while, but I'll merge as I go in case anyone else wants to work on that. I still need to hear back from Joe about improvements on #1. > I'd appreciate as many testers as possible for the new 3.1.6 code. > I hope this coming Sunday the snapshot will actually work - it seems as > though nothing happened this past Sunday. Yeah, but since we need a pre-release, I'll do it by hand if things goof up. There are serious problems with running cron jobs involving CVS via SourceForge. I'm hoping they'll fix that soon--right now I have another script that periodically kills hung SSH connections. -Geoff |
From: Gilles D. <gr...@sc...> - 2002-01-24 20:16:13
|
Last week I wrote: > Here's my own checklist: > > 1. better configure test for regex problems on BSDi > 2. way to override "no server" problem > 3. handle noindex_start & noindex_end as string lists in HTML parser > 4. a "match all documents" mechanism in htsearch > 5. a way of specifying relative date ranges in htsearch > 6. release notes > 7. merge maindocs updates into htdoc and vice versa > > I was going to tackle 5 today or maybe tomorrow. I'll see if I can come > up with something for 3 after that. I'd really appreciate it if you did > 1, 4, 6 & 7. I'm not sure about 2, as there isn't an easy fix to make it > work well, so the best that could be done quickly may be just an attribute > to make it revert to the pre-3.1.5 behaviour. OK, so you've done 1, 6 and a part of 7. I've done 4 & 5. I'm thinking we can punt 3 for now, and leave it for 3.2.0b<something>, because I don't have time to work on it, and with the proper handling of <script> and <style> tags in HTML.cc it's probably not that necessary anymore to enhance noindex_start. So, I think all that's missing now is 2, the rest of 7, and maybe a sample whatsnew.html search form that shows how to use 4 & 5. > Some of the testing that's needed most is trying out the last few attributes > I've added, as all I've done so far is make sure the code compiles and runs > with my current set of attributes on my site. So, I haven't really tested > description_meta_tag_names and translate_latin1 at all, nor the new code > for use_doc_date. > > I was also hoping to hear back from Alexander Lebedev about completing the > audit of questionable 3-letter word expansions in english.0 (from L to Z). I e-mailed Alexander twice, and not a peep, so I think we'll leave this file as it is. I don't have time to complete the audit. I'll try to get the whatsnew.html form done tomorrow, and leave 2 & 7 in Geoff's hands. I'd appreciate as many testers as possible for the new 3.1.6 code. I hope this coming Sunday the snapshot will actually work - it seems as though nothing happened this past Sunday. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2002-01-24 19:12:46
|
After a few tries at something automatic, I decided to stick to the simple approach for now to improve the regex problems on BSDI and other platforms. One catch is that in 3.1.6, we don't have the host "triplet," so we don't know that it's *-*-bsdi or the like. Attached is a patch for a --with-rx flag to configure which ignores the regex test and won't build the regex.o file. It will also switch HtRegex to use the rxposix header, which should use rx like Joe was suggesting. (Joe, I fixed the problem with the previous patch I sent you. This one will definitely not build regex.o.) For me to make this automatic, I'll need to get a regex test that reliably fails. I can't dig up what causes the segfault with the included regex.o, which would allow me to improve the current testprogram. I still need to do some minor cleanups on this--it doesn't currently confirm that a rxposix.h header exists and it doesn't handle the case where the user tries --without-rx. But if the approach works, then these are easy fixes. -Geoff |
From: Gilles D. <gr...@sc...> - 2002-01-24 18:31:00
|
According to Geoff Hutchison: > Before I go and try to clean up the changes from the maindocs -> > htdig-3.1.6, I wanted to check on something. In the maindocs, we have the > standard SourceForge logo at the bottom--this is required by SF and is > actually a useful way of gauging # of hits to the htdig websites. > > I think I should leave this off in the htdoc/ version of the files. For > one, I think these files should be completely self-contained and not > require network access to view. For another, I think I'd be upset if I > found out the documentation in a package was "phoning home" to count > accesses. > > Thoughts? Anyone think I should include the SF logo in htdoc/ files? I agree completely that htdoc should be self-contained. If we're going to include the SF logo, then it should be bundled-in and the HTML docs should use local, relative references for any images they include. But, if we have to change all the docs from their maindocs counterparts anyway, may as well leave off the SF logo if we don't need it. It wouldn't bother me at all if we did keep the logo, though, as long as it's bundled. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2002-01-24 18:16:40
|
Before I go and try to clean up the changes from the maindocs -> htdig-3.1.6, I wanted to check on something. In the maindocs, we have the standard SourceForge logo at the bottom--this is required by SF and is actually a useful way of gauging # of hits to the htdig websites. I think I should leave this off in the htdoc/ version of the files. For one, I think these files should be completely self-contained and not require network access to view. For another, I think I'd be upset if I found out the documentation in a package was "phoning home" to count accesses. Thoughts? Anyone think I should include the SF logo in htdoc/ files? -Geoff |
From: Gilles D. <gr...@sc...> - 2002-01-24 01:33:39
|
According to Dan Cutting: > I agree that the patch is not portable enough to be used generally - I was > just offering it for people running something similar to my system which > presumably many are as 3.1.5 is the current stable release (and really, who > uses anything except GNU Linux? ;). Fair enough. I should point out that the official patch archive for htdig is at ftp://ftp.ccsf.org/htdig-patches/, and there has been a words-db.cc-rundig.0 patch in the 3.1.5 subdirectory for quite a while that deals with this problem. I appreciate your efforts and your desire to share your work, but there's not a lot of point in reinventing the wheel. > Anyway, I tried the 3.1.6 development snapshot with LC_LOCALE=C and this > does indeed appear to fix the problem which is great. However, as this is a > dev snapshot, is it actually robust enough to use in a production > environment? When will the stable release of 3.1.6 occur? Yes, in this case it certainly is robust enough - quite a bit more than even 3.1.5, which was probably the most stable release until now. The final release of 3.1.6 should be fairly soon, pending a few more documentation fixes, a few tweaks, and hopefully lots of pre-release testing and feedback. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Dan C. <dan...@so...> - 2002-01-23 23:38:31
|
I agree that the patch is not portable enough to be used generally - I was just offering it for people running something similar to my system which presumably many are as 3.1.5 is the current stable release (and really, who uses anything except GNU Linux? ;). Anyway, I tried the 3.1.6 development snapshot with LC_LOCALE=C and this does indeed appear to fix the problem which is great. However, as this is a dev snapshot, is it actually robust enough to use in a production environment? When will the stable release of 3.1.6 occur? Regards, Dan -----Original Message----- From: Gilles Detillieux To: dan...@so... Cc: htd...@li... Sent: 1/24/02 7:04 AM Subject: Re: [htdig-dev] Sort order of htmerge <snip> OK, there are two known bugs in htmerge at work here, both of which are fixed in the 3.1.6 development snapshot at http://www.htdig.org/files/snapshots/ <snip> There are a couple problems I can think of with this patch, apart from the fact you shouldn't need it with 3.1.6. First of all, it makes htmerge dependent on GNU-style options for sort, which likely won't port well to non-linux systems. Secondly, I'm not completely convinced that if you specify only field 1 as the sort key it will still consistently cluster together the same 'i:' fields for a given word. <snip> Try 3.1.6 and set LC_COLLATE=C before calling htmerge, and see if that doesn't clear up all your merging problems. ********************************************************************** visit http://www.solution6.com visit http://www.eccountancy.com - everything for accountants. UK Customers - http://www.solution6.co.uk ********************************************************************* This email message (and attachments) may contain information that is confidential to Solution 6. If you are not the intended recipient you cannot use, distribute or copy the message or attachments. In such a case, please notify the sender by return email immediately and erase all copies of the message and attachments. Opinions, conclusions and other information in this message and attachments that do not relate to the official business of Solution 6 are neither given nor endorsed by it. ********************************************************************* |
From: Gilles D. <gr...@sc...> - 2002-01-23 20:04:18
|
According to Dan Cutting: > This is my first post and I've only been using Ht://Dig for 2 days, so > please go easy on me. :) > First, a little about my system - I'm running Ht://Dig 3.1.5 on a Red Hat > 7.2 Linux PC. The standard sort command that is installed is 2.0.14. > > I was playing around with 3 databases which all worked fine individually. > However, I wished to merge them into a single database so that I could > search across all 3 at once. Looking through the archive I found a post from > Gilles explaining how to do this with htmerge and the -m switch. I tried > this and all seemed to work beautifully; I was able to search across all > databases. > > However I noticed that some of the documents that SHOULD have shown up in a > particular search did not. They showed up in their individual databases, and > running the htmerge program with full verbosity showed they were being > merged into the conglomerate database. Looking through the db.wordlist file > showed that the word I was searching for, 'support' was not correctly > sorted, as there were also some instances of 'supported' which were mixed in > amongst them. I couldn't work out how this could happen until I tried > playing with the sort command for a while. As one of the posts from the > archive says, sort is assumed to sort across the whole line, but this simply > does not produce the intended effect, at least on my (very common) system. OK, there are two known bugs in htmerge at work here, both of which are fixed in the 3.1.6 development snapshot at http://www.htdig.org/files/snapshots/ The first bug is that in 3.1.5 and below, htmerge -m doesn't write the fields in the merged db.wordlist in the correct order, so a default sort doesn't correctly combine words with the same document ID, so the db.words.db ends up with some incorrect records. This is now fixed. The second bug is that with locale-aware sort programs, the accented letters are folded into their accented counterparts for determining the sort order, which can cause different words to be interspersed within groups of the same word and that again leads to incorrect db.words.db records. This can be fixed by setting LC_COLLATE=C in the environment before running htmerge, but in 3.1.6 htmerge also takes countermeasures against this if the collating is wrong. It makes htmerge take longer and it generates a bigger, more fragmented db.words.db than if you set LC_COLLATE=C, but at least the db won't be missing any words. In 3.1.6, rundig also sets LC_COLLATE=C to optimize things. This may not be an issue for you if you only index 7-bit ASCII text and no accented letters. > The reason 'supported' appears above 'support' is because the whole line is > being used as the key and the 'e' in 'supported' comes before the 'i' in the > next field. Yuck! Older versions of sort treated the tab between the word and the 'i' as significant, and tab sorted before the 'e', so words came out in the expected order. This version of sort seems to ignore white space altogether, kind of like a less severe form of the -d option, but by default. I'm appalled that the developers of this program chose to modify the default behaviour in this way, but what can you do? Anyway, setting LC_COLLATE=C fixes this behaviour for the handling of spaces as well as accented letters. In any case, the actual word order doesn't matter so much as having all records with the same word clustered together, and within those clusters having all records with the same 'i:' field together. That's what htmerge expects. I think these conditions will still be met if sort ignores the tabs, but not if it starts treating two different letters as equal. > Below is a patch for htmerge/words.cc that appends a '--key=1,1' parameter > to the sort command in htmerge. This seems to fix the problem. Gilles > mentions elsewhere that the intended behaviour is to sort by the first > field, then second, etc. so you may wish to include those parameters also. > No idea if this would work on any systems apart from my own (or even if the > problem exists in different versions, etc). Obviously it would be better if > this parameter were in the Makefile or something and was configured as > necessary by make, but I don't know enough about it. There are a couple problems I can think of with this patch, apart from the fact you shouldn't need it with 3.1.6. First of all, it makes htmerge dependent on GNU-style options for sort, which likely won't port well to non-linux systems. Secondly, I'm not completely convinced that if you specify only field 1 as the sort key it will still consistently cluster together the same 'i:' fields for a given word. Try 3.1.6 and set LC_COLLATE=C before calling htmerge, and see if that doesn't clear up all your merging problems. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |