You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
From: Tod T. <tt...@ch...> - 2001-10-18 12:26:36
|
Quim Sanmarti wrote: > > My theory was that most new market trends (worth paying > > attention to) are usually already, or quickly will be, reflected in the open > > > source development community. > > Entering deep water -- opinion unreliable from this point :) > My perception is eventually the inverse. Open-source has been traditionally > being bound to research and innovation. It's now being used by companies as > an innovation channel, so that market trends emerge later from there... > Sorry, I wasn't clear on this. I agree completely. Tod |
From: Jamie A. <jam...@sl...> - 2001-10-18 01:54:09
|
Here's another quickie patch to 3.2.x. It's support for a content-type alias attribute for htdig which sorts out servers that get the content-type wrong in responses. While getting the server correctly configured is probably a better solution, this works if the server is maintained by someone else. usage is something like: content_type_aliases: text/plain=text/html and can be set for specific servers in the server block. Also included is the patch as an attachment for when the c+p gets mangled. Is the config file entry in the right format? I've not seen a definitive description of what should be where, so it's a bit of a guess. ======================================= diff -rup htdig/htcommon/defaults.cc htdig-patch3/htcommon/defaults.cc --- htdig/htcommon/defaults.cc Thu Aug 30 03:43:38 2001 +++ htdig-patch3/htcommon/defaults.cc Thu Oct 18 14:49:27 2001 @@ -271,6 +271,13 @@ http://www.htdig.org/", " \ compile time. \ </p> \ " }, \ +{ "content_type_aliases", "", \ + "string list", "htdig", "server", "SLI-special", "Indexing:Where", \ + "content_type_aliases: text/plain=text/html", " \ + This attribute tells htdig to use a different parser to that indicated by the content-type \ + returned by the server. This is occasionally useful for mis-configured servers who server \ + up dynamic content but don't set the content-type correctly. \ +" }, \ { "create_image_list", "false", \ "boolean", "htdig", "", "all", "Extra Output", "create_image_list: yes", " \ If set to true, a file with all the image URLs that \ @@ -2545,6 +2552,8 @@ form during indexing and translated for "string", "all", "", "3.2.0b1", "Extra Output", "wordlist_monitor_output: myfile", " \ Print monitoring output on file instead of the default stderr. \ " }, + + {0, 0, 0, 0, 0, 0, 0, 0, 0} }; Only in htdig-patch3/htcommon: defaults.cc~ diff -rup htdig/htdig/Document.cc htdig-patch3/htdig/Document.cc --- htdig/htdig/Document.cc Thu May 17 04:36:44 2001 +++ htdig-patch3/htdig/Document.cc Thu Oct 18 14:27:37 2001 @@ -629,7 +629,7 @@ Document::RetrieveLocal(HtDateTime date, // parsers are external programs that will be used. // Parsable * -Document::getParsable() +Document::getParsable( const String& serverName ) { static HTML *html = 0; static Plaintext *plaintext = 0; @@ -637,6 +637,8 @@ Document::getParsable() Parsable *parsable = 0; + ContentTypeAlias( serverName ); + if (ExternalParser::canParse(contentType)) { if (externalParser) @@ -701,4 +703,51 @@ int Document::ShouldWeRetry(Transport::D return 1; return 0; +} + + + +void +Document::ContentTypeAlias( const String& serverName ) +{ + HtConfiguration* config= HtConfiguration::config(); + Dictionary content_type_aliases; + + String l; + if ( serverName.length() > 0 ) + l = config->Find("server", serverName, "content_type_aliases"); + else + l = config->Find( "content_type_aliases"); + + if ( l.length() == 0 ) + return; + + String from, *to; + char *p = strtok(l, " \t"); + char *ct_alias= NULL; + while (p) + { + ct_alias = strchr(p, '='); + if (! ct_alias ) + { + p = strtok(0, " \t"); + continue; + } + *ct_alias++= '\0'; + from = p; + to= new String( ct_alias ); + content_type_aliases.Add(from.get(), to); + // fprintf (stderr, "Alias: %s->%s\n", from.get(), to->get()); + p = strtok(0, " \t"); + } + + + String* new_ct = 0; + if ( (new_ct = (String*) content_type_aliases.Find( contentType )) ) + { + if ( debug > 1 ) + cout << "Translating content type '" << contentType << "' to '" << *new_ct << "'\n"; + contentType = *new_ct; + } + } diff -rup htdig/htdig/Retriever.cc htdig-patch3/htdig/Retriever.cc --- htdig/htdig/Retriever.cc Tue Oct 16 15:27:16 2001 +++ htdig-patch3/htdig/Retriever.cc Thu Oct 18 14:28:29 2001 @@ -802,7 +802,7 @@ Retriever::RetrievedDocument(Document &d // routines. // This will generate the Parsable object as a specific parser // - Parsable *parsable = doc.getParsable(); + Parsable *parsable = doc.getParsable( base->host() ); if (parsable) parsable->parse(*this, *base); else ======================================= Jamie Anstice Search Engineer S.L.I. Systems jam...@sl... ph: 64 961 3262 mobile: 64 21 264 9347 |
From: Geoff H. <ghu...@ws...> - 2001-10-17 21:02:21
|
On Wed, 17 Oct 2001, Gilles Detillieux wrote: > > Are there any particular dos and donts concerning the access of your searcher > > class? Can I use one searcher incarnation for multiple searches? > > > > Any comments appreciated. > > I haven't seen any responses to your question, but I think perhaps either Strange, I thought I included the list on that. So it goes. > Michael Haggerty or Geoff Hutchison should be able to answer. No that's not quite right. Michael was doing some restructuring of htdig, not htsearch. > classes in other programs. With the 3.2.0b4 code as it stands now, I don't > think it's that easy to just strip out what you need. This is correct, though if I actually find "free time," I'm going to try to finish up that htsearch work. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
From: Geoff H. <ghu...@ws...> - 2001-10-17 21:00:13
|
On Wed, 17 Oct 2001, Gilles Detillieux wrote: > > I've just tagged the files with htdig-3-2-0-b4 > > OK, I'm not that versed in CVS, but doesn't this mean that your commits > will only be seen by people who check out the htdig tree using that tag. Alas, you're both confusing tags and branches slightly. Tagging the files with htdig-3-2-0-b4 isn't good because it's strictly reserved as a label for the 3.2.0b4 RELEASE. Since that hasn't happened yet, you might still change the file before then. On the other hand, tagging files is like sticking a bright yellow label on them in the file cabinet--it doesn't change that they're in the wrong drawer (branch). I think I'll try to move the 3-2-x branch over to the mainline this weekend. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
From: Gilles D. <gr...@sc...> - 2001-10-17 20:36:03
|
According to Joe R. Jah: > On Wed, 3 Oct 2001, Gilles Detillieux wrote: > > Date: Wed, 3 Oct 2001 09:51:03 -0500 (CDT) > > From: Gilles Detillieux <gr...@sc...> > > To: Joe R. Jah <jj...@cl...> > > Cc: htd...@li... > > Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots > > > > > > > > If you get a chance to run old and new snapshots of htdig with -vvv and > > > > > > compare the outputs, you may be able to track down the source of the > > > > > > different URLs that are parsed in both cases. To do this in a meaningful > > > > > > way, though, you'll need to try a static site, or perhaps a snapshot of > > > > > > your site, so you don't get thrown off in your comparisons by updates > > > > > > to the site between digs. > > > > > > > > > > Yes, I have kept that snapshot for a happy occasion like that;) > > > > > > > > Keep me posted if you get a chance to run this test with both snapshots. > > > > I can't think of any changes to 3.1.6 that would cause it to lose valid > > > > URLs, but it would be good to confirm without a doubt that the lost URLs > > > > on your system are all indeed URLs that should not have been indexed. > > > > > > In the happy hour;))) > > > > It might be best if you're sober when you do this test. ;-) > > The happy hour turned into a couple of unhappy weeks:( > > -r--r--r-- 1 jjah www 24621528 Oct 2 13:20 rundig_vvv.082901 > -r--r--r-- 1 jjah www 20266702 Oct 2 14:15 rundig_vvv.093001 > > I found 82 links from one document with META ROBOT: Noindex tag;) I could > not find an efficient way of hunting down the other 138 links that were > unaccounted for in two 20 meg+ files; however, I must assume that they are > some sort of duplicates;-/ Hmm. Too bad we couldn't get something more definitive. I'm fairly confident that the changes to the HTML parser didn't break anything, but I'd feel much more comfortable if we could explain the missing files you discovered rather than just assuming it's OK. If I recall, there were 88 URLs with doubled slashes that were eliminated in an earlier test, but that still leaves around 50 URLs unaccounted for. If there's any way you can take a snapshot of your site, or a few major subdirectories, and duplicate them somewhere else where they won't get modified, it would be a big help in getting conclusive results. If you index the exact same files with 3.1.5 and 3.1.6, you should be able to diff the output of htdig -vvv from both, and pinpoint exactly where the differences are happening. I know this is asking a lot, but it would be a shame to release 3.1.6 after all the work that's gone into it, only to discover afterward that it introduced a serious bug. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Gilles D. <gr...@sc...> - 2001-10-17 19:58:47
|
According to Dan Langille: > On 12 Oct 2001 at 14:53, Gilles Detillieux wrote: > > According to Dan Langille: > > > I have just committed a fix to the php-wrapper. This may or may not > > > have been a potential exploit. The fix prevents people from including > > > arbitrary HTML or PHP code in their search string. The fix > > > strips such tags from the input string. > > > > OK, but I thought I should point out that you're still committing to > > the main branch of the htdig source tree, which hasn't seen any activity > > since Feb 2000. If you want your wrapper to be included in the 3.2.0b4 > > development, you have to check out the "htdig-3-2-x" branch of the tree > > and commit your changes to that. > > I've just tagged the files with htdig-3-2-0-b4 OK, I'm not that versed in CVS, but doesn't this mean that your commits will only be seen by people who check out the htdig tree using that tag. As far as I know, that tag hasn't been used by anyone else. Is this correct, Geoff, or am I confusing tags and branches? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Gilles D. <gr...@sc...> - 2001-10-17 19:54:41
|
According to Wolfgang Mueller: > Hi, > I am currently looking into writing a htdig plugin for the GIFT. > > I want people to configure htdig the usual way and (if necessary) to point to > a htdig configuration within the usual GIFT configuration file. > > What I would like to write, is just something short which does not use html > for querying htdig, but which interacts with your internal search functions, > something like a stripped version of the main of htsearch. > > Are there any particular dos and donts concerning the access of your searcher > class? Can I use one searcher incarnation for multiple searches? > > Any comments appreciated. I haven't seen any responses to your question, but I think perhaps either Michael Haggerty or Geoff Hutchison should be able to answer. Both of them are working on, or planning to work on, some pretty major restructuring of the htsearch code, at least in part to make it easier to use the searcher classes in other programs. With the 3.2.0b4 code as it stands now, I don't think it's that easy to just strip out what you need. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Gilles D. <gr...@sc...> - 2001-10-17 19:45:11
|
According to Andrew Daviel: > On Sun, 30 Sep 2001, Geoff Hutchison wrote: > re. logging broken links > > > > I don't really see this as a needed feature to ht://Dig itself since > > you can do this as a script running on top of the htdig output using > > the -s flag. See > > <http://www.htdig.org/files/contrib/scripts/showdead.pl> > > <http://www.htdig.org/files/contrib/scripts/report_missing_pages.pl> > > I played with this briefly. > Htdig listed a broken link (404) on the same server, and a 404 link to > another server. > > It didn't list a 502 (connection refused) or 500 (unknown host). > It also doesn't grab any mail information or owner metadata, so > you'd either have to have a database of sites/URL fragments against > authors, or re-spider the list of referring documents to gather author > information, if you wanted to mail authors or sort the broken list > by author. > My robot produces e.g. http://www.triumf.ca/trsearch/errors2.html > (but I'm too lazy to fix my links - ad...@tr...) Neither htdig 3.1.x nor 3.2.x deal with 500 and 502 error codes right now. Are you indexing through a proxy server? Normally, htdig will detect these two error conditions itself, and set the appropriate internal error status codes to generate the error messages for missing pages, but I don't know what happens when a proxy server detects these error conditions and reports them back to htdig. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2001-10-17 15:11:46
|
On Wed, 17 Oct 2001, Quim Sanmarti wrote: > > That's when I began to wonder if anybody in the htdig > > development community had looked into implementing 'bayesian' > > searching, or if htdig could do 'it', hence my vague post. > > The htdig databases are postitively not prepared to deal with such > techniques, IMHO. They are not intended to. The power of htdig is based in > 'classical' boolean queries. I don't know that I'd call them "classical" anymore, but I'd agree that I'd design a different word database backend if I wanted to do Bayesian queries. But I think you can get pretty high quality results without this--AFAIK, Google doesn't use them and most people see them as the target. > > My theory was that most new market trends (worth paying > > attention to) are usually already, or quickly will be, reflected in the open > > source development community. > My perception is eventually the inverse. Open-source has been traditionally > being bound to research and innovation. It's now being used by companies as > an innovation channel, so that market trends emerge later from there... This depends a lot on the development effort. Certainly gcc has some true innovation and is a great example of getting truly fantastic people together--I doubt you could ever afford to pay for all the development on gcc. In this case, I think parts of ht://Dig could be used for research purposes and I think there are several research-grade algorithms that could be implemented without too much effort (n-gram fuzzy algorithms come to mind). On the other hand, the number of active contributors to the project right now is extremely low and so I think we'd need an infusion of "fresh brains" before this could happen. As Quim pointed out as well, there are other packages which attempted to tackle Bayesian searches. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
From: Quim S. <qs...@gt...> - 2001-10-17 14:29:42
|
> -----Mensaje original----- > De: htd...@li... > [mailto:htd...@li...]En nombre de Tod Thomas > Enviado el: miercoles, 17 de octubre de 2001 15:23 > Para: htd...@li... > Asunto: [htdig-dev] Bayesian Algorithm Part II [snip] > > A common marketing thread for a number of the vendors was the > utilization of the 'bayesian algorithm'. Some vendors, like Autonomy, do use this kind of techniques in their search engines. 'Bayesian', or probabilistic, models are computed from indexed documents. The longer the query, the better the results. As with vector-space similarity measures, this is particularly useful when searching 'documents similar to this' or when classifiyng documents. Results tend to be poorer when using short queries. > I have to admit that those products tended to outperform > the others, particularly when it came to language specific > searches. It was almost as if the search could understand the concept behind > the request, not just the words. In fact, what is evaluated is the 'blend' of the words, i.e. the more or less roughly estimated probability of finding a given set of words in each document. > > That's when I began to wonder if anybody in the htdig > development community had looked into implementing 'bayesian' > searching, or if htdig could do 'it', hence my vague post. > The htdig databases are postitively not prepared to deal with such techniques, IMHO. They are not intended to. The power of htdig is based in 'classical' boolean queries. > My theory was that most new market trends (worth paying > attention to) are usually already, or quickly will be, reflected in the open > source development community. Entering deep water -- opinion unreliable from this point :) My perception is eventually the inverse. Open-source has been traditionally being bound to research and innovation. It's now being used by companies as an innovation channel, so that market trends emerge later from there... > I suspected this might be the case with htdig > too. Given the individual response that I got there is at least some > interest so by elaborating a little on my original post maybe I can learn more. Comments? > Nearly related to probabilistic stuff is Xapian, a.k.a. Omseek, a.k.a. Omsee, a.k.a. Open Muscat, an open-source project intended as a probabilistic search-engine framework. Initially financed by Brightstation, was some time ago left to its own. Now lives in Sourceforge. More in the research field, there's libbow/rainbow by Andrew McCallum et al. from CMU, including bayesian classifiers, vector-space algorithms, and other nice artifacts. Here at gtd, we're experimenting internally with some new vector-space based search and classification algorithms. What we have does look quite promising, but AFAIK it's not to be open-sourced -- by now. -- Quim |
From: Tod T. <tt...@ch...> - 2001-10-17 13:14:38
|
At a helpful list member's prompting, it might help if I present the reason for my original post. My original query was intended to see if anybody was looking into implementing this into htdig. We've just gone through a pretty extensive vendor selection process to find the 'best' search engine to couple with an internal knowledge management initiative. A common marketing thread for a number of the vendors was the utilization of the 'bayesian algorithm'. I have to admit that those products tended to outperform the others, particularly when it came to language specific searches. It was almost as if the search could understand the concept behind the request, not just the words. That's when I began to wonder if anybody in the htdig development community had looked into implementing 'bayesian' searching, or if htdig could do 'it', hence my vague post. My theory was that most new market trends (worth paying attention to) are usually already, or quickly will be, reflected in the open source development community. I suspected this might be the case with htdig too. Given the individual response that I got there is at least some interest so by elaborating a little on my original post maybe I can learn more. Comments? Thanks - Tod |
From: Andrew D. <an...@da...> - 2001-10-16 19:34:44
|
(I've been trying to write this for weeks and keep getting distracted..) I have been working on a geographic search engine, and as I mentioned in my earlier "features" message, I am getting fed up with my Perl robot, and am trying to use htdig. I have been somewhat successful and have a version without map navigation at http://geotags.com/htdig/ My question is really how much support, if any, I might get from the htdig community for this, either as mainstream htdig or from anyone else interested in this kind of thing. The basic idea is that given a set of Web pages describing physical objects, such as restaurants, bridges, parks etc., that the search engine is capable of finding the nearest one. This requires metadata on the page explicitly giving the position being described (though it can sometimes be guessed from things like US zipcodes in the text), and a search algorithm that can score according to a geographic distance. The search algorithm is not a true geographic one (as for example "show me all objects completely or partially inside this polygon") but rather a modification of a text search, e.g. "pizza AND restaurant SORTBY distance" The changes required are (so far) A database element to store a position for each page (I actually store a region code and placename too but don't use them) An addition to the HTML parser to get the metadata An addition to the CGI parser to get a requested position (map click) A weighting algorithm to calculate geographic distance I also have a config item to essentially force a ROBOTS NONE if there is no geographic tag on a page, so that I can refrain from indexing untagged pages. I am also trying to add support for position passed in an experimental HTTP header, which allows one to dispense with the map and potentially generate requests based on current position automatically, e.g. using GPS. This is essentially what is online at http://geotags.com/htdig/, using a fixed map. I will probably create custom templates and then try to run this version of htsearch using the original Perl as a wrapper, to enable the map zoom/pan features, which requires scaling the map clicks according to the currently displayed map. I could either do the mapclick scaling externally, passing pure Lat/Long to htsearch, or pass the current map extents to htsearch and do it internally. Doing the whole thing including the map manipulations and graphical markup is somewhat more complicated, and might require hardwiring some map dependant code, unless it could be done in templates. I haven't really looked at that yet. -- Andrew Daviel geotags.com etc. |
From: Florian H. <fl...@ha...> - 2001-10-15 16:46:04
|
I just sent this to bugtraq: In Fri, Oct 12, 2001 at 12:59:13PM -0600, Dave Ahmad wrote: > On Thu, 11 Oct 2001, bugtraq wrote: > > http://www.perl.com/search/index.ncsp?sp-q=%3C%69%6D%67%20%73%72%63%3D%68%74%74%70%3A%2F%2F%31%39%39%2E%31%32%35%2E%38%35%2E%34%36%2F%74%69%6D%65%2E%6A%70%67%3E > Does anyone know which search engine software this is? I don't know which engine perl.com uses, but if you have the template parameter WORDS in you templates, htdig 3.1.5 puts the unquoted img-tag into the result page. Funnily enough, the htdig 3.1.5 on htdig.org encodes the offending string in <input type="text" size="30" name="words" value="<img src=http://199.125.85.46/time.jpg>"> while the distributed htdig 3.1.5 (here the debian-version 3.1.5-2) doesn't: <input type="text" size="30" name="words" value="<img src=http://199.125.85.46/time.jpg>"> (And there is neither a security section on htdig.org nor an email address for bug reports... so I am crossposting this to htdig-general) Yours, Florian Hars. |
From: Tod T. <tt...@ch...> - 2001-10-15 11:55:16
|
Has anybody heard of this in relation to search technology and would like to provide some input on the topic? Thanks - Tod |
From: Joe R. J. <jj...@cl...> - 2001-10-14 08:26:39
|
On Wed, 3 Oct 2001, Gilles Detillieux wrote: > Date: Wed, 3 Oct 2001 09:51:03 -0500 (CDT) > From: Gilles Detillieux <gr...@sc...> > To: Joe R. Jah <jj...@cl...> > Cc: htd...@li... > Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots > > > > > > If you get a chance to run old and new snapshots of htdig with -vvv and > > > > > compare the outputs, you may be able to track down the source of the > > > > > different URLs that are parsed in both cases. To do this in a meaningful > > > > > way, though, you'll need to try a static site, or perhaps a snapshot of > > > > > your site, so you don't get thrown off in your comparisons by updates > > > > > to the site between digs. > > > > > > > > Yes, I have kept that snapshot for a happy occasion like that;) > > > > > > Keep me posted if you get a chance to run this test with both snapshots. > > > I can't think of any changes to 3.1.6 that would cause it to lose valid > > > URLs, but it would be good to confirm without a doubt that the lost URLs > > > on your system are all indeed URLs that should not have been indexed. > > > > In the happy hour;))) > > It might be best if you're sober when you do this test. ;-) The happy hour turned into a couple of unhappy weeks:( -r--r--r-- 1 jjah www 24621528 Oct 2 13:20 rundig_vvv.082901 -r--r--r-- 1 jjah www 20266702 Oct 2 14:15 rundig_vvv.093001 I found 82 links from one document with META ROBOT: Noindex tag;) I could not find an efficient way of hunting down the other 138 links that were unaccounted for in two 20 meg+ files; however, I must assume that they are some sort of duplicates;-/ Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |
From: Geoff H. <ghu...@us...> - 2001-10-14 07:13:46
|
STATUS of ht://Dig branch 3-2-x RELEASES: 3.2.0b4: In progress 3.2.0b3: Released: 22 Feb 2001. 3.2.0b2: Released: 11 Apr 2000. 3.2.0b1: Released: 4 Feb 2000. SHOWSTOPPERS: KNOWN BUGS: * Odd behavior with $(MODIFIED) and scores not working with wordlist_compress set but work fine without wordlist_compress. (the date is definitely stored correctly, even with compression on so this must be some sort of weird htsearch bug) * Not all htsearch input parameters are handled properly: PR#648. Use a consistant mapping of input -> config -> template for all inputs where it makes sense to do so (everything but "config" and "words"?). * If exact isn't specified in the search_algorithms, $(WORDS) is not set correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can we fix this?) * META descriptions are somehow added to the database as FLAG_TITLE, not FLAG_DESCRIPTION. (PR#859) PENDING PATCHES (available but need work): * Additional support for Win32. * Memory improvements to htmerge. (Backed out b/c htword API changed.) * MySQL patches to 3.1.x to be forward-ported and cleaned up. (Should really only attempt to use SQL for doc_db and related, not word_db) NEEDED FEATURES: * Field-restricted searching. * Return all URLs. * Handle noindex_start & noindex_end as string lists. * Handle local_urls through file:// handler, for mime.types support. * Handle directory redirects in RetrieveLocal. * Merge with mifluz TESTING: * httools programs: (htload a test file, check a few characteristics, htdump and compare) * Turn on URL parser test as part of test suite. * htsearch phrase support tests * Tests for new config file parser * Duplicate document detection while indexing * Major revisions to ExternalParser.cc, including fork/exec instead of popen, argument handling for parser/converter, allowing binary output from an external converter. * ExternalTransport needs testing of changes similar to ExternalParser. DOCUMENTATION: * List of supported platforms/compilers is ancient. * Add thorough documentation on htsearch restrict/exclude behavior (including '|' and regex). * Document all of htsearch's mappings of input parameters to config attributes to template variables. (Relates to PR#648.) Also make sure these config attributes are all documented in defaults.cc, even if they're only set by input parameters and never in the config file. * Split attrs.html into categories for faster loading. * require.html is not updated to list new features and disk space requirements of 3.2.x (e.g. phrase searching, regex matching, external parsers and transport methods, database compression.) * TODO.html has not been updated for current TODO list and completions. OTHER ISSUES: * Can htsearch actually search while an index is being created? (Does Loic's new database code make this work?) * The code needs a security audit, esp. htsearch * URL.cc tries to parse malformed URLs (which causes further problems) (It should probably just set everything to empty) This relates to PR#348. |
From: Dan L. <da...@la...> - 2001-10-13 02:14:16
|
On 12 Oct 2001 at 14:53, Gilles Detillieux wrote: > According to Dan Langille: > > I have just committed a fix to the php-wrapper. This may or may not > > have been a potential exploit. The fix prevents people from including > > arbitrary HTML or PHP code in their search string. The fix > > strips such tags from the input string. > > OK, but I thought I should point out that you're still committing to > the main branch of the htdig source tree, which hasn't seen any activity > since Feb 2000. If you want your wrapper to be included in the 3.2.0b4 > development, you have to check out the "htdig-3-2-x" branch of the tree > and commit your changes to that. I've just tagged the files with htdig-3-2-0-b4 -- Dan Langille The FreeBSD Diary - http://freebsddiary.org/ - practical examples |
From: Gilles D. <gr...@sc...> - 2001-10-12 19:53:10
|
According to Dan Langille: > I have just committed a fix to the php-wrapper. This may or may not > have been a potential exploit. The fix prevents people from including > arbitrary HTML or PHP code in their search string. The fix > strips such tags from the input string. OK, but I thought I should point out that you're still committing to the main branch of the htdig source tree, which hasn't seen any activity since Feb 2000. If you want your wrapper to be included in the 3.2.0b4 development, you have to check out the "htdig-3-2-x" branch of the tree and commit your changes to that. At some point, Geoff is going to fold this branch back into the main branch, but this hasn't happened yet. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Dan L. <da...@la...> - 2001-10-12 18:01:48
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I have just committed a fix to the php-wrapper. This may or may not have been a potential exploit. The fix prevents people from including arbitrary HTML or PHP code in their search string. The fix strips such tags from the input string. To test the exploit, try entering an IMG html tag into your search field, such as <img src=http://www.htdig.org/htdig_big.gif>. If you see: There were no matches for [IMAGE] found on the website. where [IMAGE] is the htDig image, then you have not patched your system. - -- Dan Langille The FreeBSD Diary - http://freebsddiary.org/ - practical examples -----BEGIN PGP SIGNATURE----- Version: PGP 6.5.8 -- QDPGP 2.61c Comment: http://community.wow.net/grt/qdpgp.html iQA/AwUBO8cv+QoLFxTP+508EQLRdQCg4+FE7xo/NxM+TpvS/0gyT9LYYTYAoOCM bV1/W/eESdonK1V4rIfoebth =m89W -----END PGP SIGNATURE----- |
From: Wolfgang M. <Wol...@cu...> - 2001-10-11 12:27:03
|
Hi, I am currently looking into writing a htdig plugin for the GIFT. I want people to configure htdig the usual way and (if necessary) to point to a htdig configuration within the usual GIFT configuration file. What I would like to write, is just something short which does not use html for querying htdig, but which interacts with your internal search functions, something like a stripped version of the main of htsearch. Are there any particular dos and donts concerning the access of your searcher class? Can I use one searcher incarnation for multiple searches? Any comments appreciated. Wolfgang -- Dr. Wolfgang Müller, assistant == teaching assistant Personal |
From: Gilles D. <gr...@sc...> - 2001-10-09 21:29:14
|
According to Thomas Netousek: > I am running htdig-3.2.0-0.b3.4 from the RedHat linux 7.1 distribution and > I am indexing documents > which have all types of funny characters like e.g. single quotes spelled > as ’ > > I have seen other reports about the parser failing for &amp, so I am > wondering if this could > be sort of a similar problem ? > > Btw, I am also running htdig-3.1.5 on another machine with translate_... > set to true and it works > like a charm there. I believe 3.2.0b3 will not translate numeric entities where the number is larger than 255. 3.1.5 does, but it's a bug, because it only used 8 bit characters internally, so it only keeps the bottom 8 bits of this number. Because in 3.2.0b3 the numeric entity isn't converted, the "&" goes into the excerpt literally, and so it's turned into an & entity on output so it should display literally as "&", so you will see the numeric entity. Given the 8-bit character set limitations in both 3.1 and 3.2, I thing that 3.2's behaviour is the lesser of two evils when it comes to handling numeric entities above 255. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2001-10-07 21:07:05
|
* Name: ht://Dig (htsearch CGI) * Versions affected: 3.1.0b2 and more recent, including 3.1.5 and 3.2.0b3 * Vulnerability: (Potential remote exposure. Denial of Service.) * Details: The htsearch CGI runs as both the CGI and as a command-line program. The command-line program accepts the -c [filename] to read in an alternate configuration file. On the other hand, no filtering is done to stop the CGI program from taking command-line arguments, so a remote user can force the CGI to stall until it times out (resulting in a DOS) or read in a different configuration file. For a remote exposure, a specified configuration file would need to be readable via the webserver UID, e.g. via anonymous FTP with upload enabled or samba world-readable log files are the possible targets) to potentially retrieve files readable by the webserver UID. e.g. nothing_found_file: /path/to/the/file/we/steal * Potential exploit: http://your.host/cgi-bin/htsearch?-c/dev/zero http://your.host/cgi-bin/htsearch?-c/path/to/my.file * Fix: Upgrade to current prerelease versions of 3.1.6 or 3.2.0b4, or apply attached patches. Prerelease versions are available from <http://www.htdig.org/files/snapshots/> |
From: Geoff H. <ghu...@us...> - 2001-10-07 07:13:44
|
STATUS of ht://Dig branch 3-2-x RELEASES: 3.2.0b4: In progress 3.2.0b3: Released: 22 Feb 2001. 3.2.0b2: Released: 11 Apr 2000. 3.2.0b1: Released: 4 Feb 2000. SHOWSTOPPERS: KNOWN BUGS: * Odd behavior with $(MODIFIED) and scores not working with wordlist_compress set but work fine without wordlist_compress. (the date is definitely stored correctly, even with compression on so this must be some sort of weird htsearch bug) * Not all htsearch input parameters are handled properly: PR#648. Use a consistant mapping of input -> config -> template for all inputs where it makes sense to do so (everything but "config" and "words"?). * If exact isn't specified in the search_algorithms, $(WORDS) is not set correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can we fix this?) * META descriptions are somehow added to the database as FLAG_TITLE, not FLAG_DESCRIPTION. (PR#859) PENDING PATCHES (available but need work): * Additional support for Win32. * Memory improvements to htmerge. (Backed out b/c htword API changed.) * MySQL patches to 3.1.x to be forward-ported and cleaned up. (Should really only attempt to use SQL for doc_db and related, not word_db) NEEDED FEATURES: * Field-restricted searching. * Return all URLs. * Handle noindex_start & noindex_end as string lists. * Handle local_urls through file:// handler, for mime.types support. * Handle directory redirects in RetrieveLocal. * Merge with mifluz TESTING: * httools programs: (htload a test file, check a few characteristics, htdump and compare) * Turn on URL parser test as part of test suite. * htsearch phrase support tests * Tests for new config file parser * Duplicate document detection while indexing * Major revisions to ExternalParser.cc, including fork/exec instead of popen, argument handling for parser/converter, allowing binary output from an external converter. * ExternalTransport needs testing of changes similar to ExternalParser. DOCUMENTATION: * List of supported platforms/compilers is ancient. * Add thorough documentation on htsearch restrict/exclude behavior (including '|' and regex). * Document all of htsearch's mappings of input parameters to config attributes to template variables. (Relates to PR#648.) Also make sure these config attributes are all documented in defaults.cc, even if they're only set by input parameters and never in the config file. * Split attrs.html into categories for faster loading. * require.html is not updated to list new features and disk space requirements of 3.2.x (e.g. phrase searching, regex matching, external parsers and transport methods, database compression.) * TODO.html has not been updated for current TODO list and completions. OTHER ISSUES: * Can htsearch actually search while an index is being created? (Does Loic's new database code make this work?) * The code needs a security audit, esp. htsearch * URL.cc tries to parse malformed URLs (which causes further problems) (It should probably just set everything to empty) This relates to PR#348. |