You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
| 2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
| 2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
| 2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
| 2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
| 2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|
From: Geoff H. <ghu...@ws...> - 2002-11-08 15:57:24
|
On Fri, 8 Nov 2002, Lachlan Andrew wrote: > Regarding the flags, I can see why it makes sense to store > the information, but it doesn't need to be as a bit-field. I do think it makes sense to have a bit field. Remember that we're not just planning a database for HTML documents. Yes, some of the current bits are exclusive, but I can imagine that some XML documents might want combined bits, e.g.: <foo ...> text to be indexed <bar> more text </bar> </foo> Yes, some of the current flags could be in a lookup, but some (i.e. FLAG_CAPITAL) are clearly a bitfield. I could also see some situations where FLAG_AUTHOR and FLAG_KEYWORDS are combined, and conceivably the parser should be smart enough to decide if FLAG_LINK_TEXT and FLAG_URL should be combined, e.g. <a href="http://foo.com/">foo.com</a> Yes, you might argue these are somewhat contrived. But when we were first planning the database format for 3.2, we considered that arbitrary documents and XML might be included in a "3.2" release with user-defined bits and field-restricted searching. > can thank Mr Gates for that one... However, it could also > be treated as "level 3 heading", unless it is already given > extra weight somehow. It is not given extra weight currently. Again, the catch would be with field-restricted searches. If we treat things as a level-3 heading or whatever, then we have to block a search at that level as you'll get more than you asked for. -Geoff |
|
From: Budd, S. <s....@ic...> - 2002-11-08 11:40:15
|
using solaris 8 , gcc 2.95.2 htdig-3.2.0b4-20021103 It compiles,digs and searches. Haven't examined results in detail. -----Original Message----- From: Geoff Hutchison [mailto:ghu...@ws...] Sent: Thursday, November 07, 2002 11:40 PM To: Brian White Cc: htd...@li... Subject: Re: [htdig-dev] defaults.xml On Wed, 6 Nov 2002, Brian White wrote: > I haven't heard anything back about this since I posted > my patch - I just wanted to check that it was on the radar. It's on the radar, but: a) I'm not sure if it includes the recent changes to defaults.cc from Lachlan. b) I haven't heard *anything* from anyone about the recent snapshots with a variety of changes, including the zlib change from Neil. Obviously if we'd like to release 3.2.0b4 soon, I'd like to know if the snapshots seem stable before I toss defaults.xml in there. (Not that I expect it'll cause any breakage, but it's still the way I'd like to do it.) -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ ------------------------------------------------------- This sf.net email is sponsored by: See the NEW Palm Tungsten T handheld. Power & Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en _______________________________________________ htdig-dev mailing list htd...@li... https://lists.sourceforge.net/lists/listinfo/htdig-dev |
|
From: Gabriele B. <an...@ti...> - 2002-11-08 06:46:06
|
>We also need to remember that there are another dozen or so >features and attributes from 3.1.6 to be added before >3.2.0b5 can be released. This means: >a) We won't be in a "feature freeze" for a while, and more >importantly >b) It would be nice to know for certain whether they should >be added to defaults.cc or defaults.xml :) I agree. I'll give a look at Brian's utility for converting .cc in XML and viceversa and build the .cc file again with just the two attributes (name, value). However, we'd need to change the Configuration.[h,cc] files in the htlib, if I am not wrong. I volounteer of doing that this week-end, if Geoff confirms me this is what need to be done. Also, I still have to apply the 'execution time' feature from the 3.1.6 (after release) code (well, I think so - now I can't control the code). Let me know and thanks a lot Brian (and Lachlan). Ciao -Gabriele -- Gabriele Bartolini - Web Programmer - ht://Dig & IWA Member - ht://Check maintainer Current Location: Prato, Tuscany, Italia an...@ti... | http://www.prato.linux.it/~gbartolini | ICQ#129221447 > 'Since I was born, Italy is worth mentioning not just for Pizza' |
|
From: Lachlan A. <lh...@ee...> - 2002-11-08 05:35:58
|
Thank you both for your replies. I had missed the "t->Release()" call in Release()... Regarding the flags, I can see why it makes sense to store the information, but it doesn't need to be as a bit-field. For example, it is impossible for something to be simultaneously meta data (keyword, description) and a heading. If the "heading" bit is specified, the other two fields could be the level: 1-4. We could save the "plain text" bit by treating "heading level 4" as plain text. This more compact encoding could be an index to a lookup table of the full bit pattern (a la nano-code). This would give complete flexibility in the actual data stored. Regarding the use of <b> etc. instead of <h1>, I think we can thank Mr Gates for that one... However, it could also be treated as "level 3 heading", unless it is already given extra weight somehow. However, I realise that this is all very low priority stuff, and if it ain't broke, don't fix it... Cheers, Lachlan On Fri, 8 Nov 2002 16:09, Geoff Hutchison wrote: > On Thursday, November 7, 2002, at 10:12 PM, Gilles Detillieux wrote: > > put all headings into one factor was to > > reduce the number > > of bits the flag would take by 5. We're going to have to > > increase this > > anyway, to accomodate custom fields, > > there were 6 slots for headings under 3.1, and it > seems like a huge waste of bits considering most won't be > used--even with 3-bit encoding. Some other document > formats also don't make much distinction between heading > levels. Do people really think that markup beyond h1, h2 > and h3 occurs? A lot of HTML I see these days uses > <strong> or <b> or <i> tagging (or worse, <font>). > > Keep in mind that every bit we add to the flags adds more > space to every word. -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Geoff H. <ghu...@ws...> - 2002-11-08 05:10:06
|
On Thursday, November 7, 2002, at 10:12 PM, Gilles Detillieux wrote: > I think there are some cases where that's true, but not necessarily in > all > cases, so I don't know how much you can optimize this. E.g., for > certain > keyword tags we allow the form <meta foo="bar">, but the configurable > keyword names must be of the form <meta name="foo" contents="bar">. > I don't know that we'd want to fully generalize this, but I'm open to > suggestions/recommendations from others. Keep in mind that the form <meta name="foo" contents="bar"> is the definitive W3C standard, whereas the other form is an older, depreciated case. I don't see much HTML like this anymore. Whether we want to completely ignore them or not is hard to say. > I'm sure there'd be a fair bit of discussion about this in the htdig-dev > archives of 2-3 years ago. I don't think it ever got formally > documented > elsewhere (yet). The reason was to allow "scoring on the fly". As well, it allows restricting word searches based on the "field" or tags that contain the words. > The decision to put all headings into one factor was to reduce the > number > of bits the flag would take by 5, so the flags can fit in a single byte. > We're going to have to increase this anyway, to accomodate custom > fields, > so it might make sense to reintroduce the distinction between heading No, the flags never were supposed to be a single byte. There happen to be 8 bits currently defined, but more than this should be actually stored for custom fields (and ideally to keep the database format identical). OTOH, there were 6 slots for headings under 3.1, and it seems like a huge waste of bits considering most won't be used--even with 3-bit encoding. Some other document formats also don't make much distinction between heading levels. Do people really think that markup beyond h1, h2 and h3 occurs? A lot of HTML I see these days uses <strong> or <b> or <i> tagging (or worse, <font>). Keep in mind that every bit we add to the flags adds more space to every word. Right now, I've specified 8 bits, including author and URL text which aren't currently used. -Geoff |
|
From: Gilles D. <gr...@sc...> - 2002-11-08 04:21:31
|
According to =3D?iso-8859-1?Q?Manuel_Jes=3DFAs_Aguilera_Castro?=3D: > Thanks for your help, but I'm trying to execute the command line using=20 > '&' and ';' and the output is always the first of several pages. I don'= t=20 > know how I can obtain the second, third, etc. pages :( >=20 >=20 > ----- Original Message ----- > From: Torsten Neuer ... > On Thursday 07 November 2002 12:16, Manuel Jes=FAs Aguilera Castro wr= ote: > > Hello friends! I need your help. > > > > I'm using a PHP script that executes a command line like this: > > > > htsearch -c /etc/htdig.db2002.conf words=3Dciudad > > > > Why the result page is the same when it executes "htsearch -c > > /etc/htdig.db2002.conf words=3Dciudad;page=3D2" ? > > > > Can you explain me the syntax of htsearch querystring in the comman= d line? > > The "page" parameter seems like doesn't work. >=20 > Ht://Dig may not honor the (more or less) new parameter delimiter cha= racter=20 > ";", depending on the version of Ht://Dig you are running. Instead, = you=20 > should use the traditional "&" to separate parameters on the command = line. >=20 > Since you are using a script to post-process the output of htsearch, = the=20 > difference of using "&" instead of ";" would not be visible to the ou= tside=20 > world, so your script still can use ";". Whether you use the new ";" delimiter or the old "&" delimiter, you have to be careful when using either of these from the shell (either in a shell script or from the command line). They are both command delimite= rs to the shell, so they MUST be quoted if they are to be passed literally to the program being called. E.g.: htsearch -c /etc/htdig.db2002.conf "words=3Dciudad;page=3D2" --=20 Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Gilles D. <gr...@sc...> - 2002-11-08 04:12:25
|
According to Lachlan Andrew: > - What is the difference between Dictionary::Destroy() > and Dictionary::Release() ? Dictionary entries associate a particular name or keyword with a pointer to an Object. Generally, but not necessarily always, when adding items to a Dictionary, you allocate a new Object to set the value field. E.g.: dict.Add(name, new String(value)); When you delete the dictionary, all the objects in it are deleted too. If you want to empty a dictionary, deleting all objects, then you'd use dict.Destroy(), but if you want to empty the dictionary and keep all the objects around (assuming there are other "live" pointers to these objects), then you can use dict.Release() to release the objects from the dictionary. Once released, the Dictionary itself can be deleted without harming the objects that were once contained in it. At least, that's my interpretation of the Dictionary class code in 3.1. Whether the code that uses the Dictionary class actually uses this properly, and whether all this is done the same way in 3.2, I couldn't say for sure. I think the changes to the destructors for DictionaryEntry and Dictionary in 3.2 are to avoid excessive recursion, which makes sense. > - Have the "factor slots" changed at some time? The > keyword and description slots seem out of sync between > ExternalParser.cc (slots 10 and 11, respectively) and > HTML.cc (slots 9 and 10). > > (Since I'm rather new to this project, I'm hesitant to > change any functionality that hasn't been flagged as a bug.) Yes, they have changed, and yes, ExternalParser.cc is indeed out of sync. It's a bug. Good eye! It should be using slot 9 for keywords and 10 for meta description. > - Is there something in the test suite to test parsing, > especially of META tags? Good question. There are META tags in the HTML test files in test/htdocs/set1, but they don't contain every type of meta tag we'd want to test. Also, it seems the htdig tests in the test directory just make sure htdig finds all the URLs it's supposed to, but there doesn't seem to be any checks that it finds all the words it's supposed to. > - It seems that <meta foo="bar"> is usually treated the > same as <meta name="foo" contents="bar">. Can it always > be? If so, we could avoid some code duplication (or > triplication, since much of it is currently in both > ExternalParser.cc and HTML.cc!). I think there are some cases where that's true, but not necessarily in all cases, so I don't know how much you can optimize this. E.g., for certain keyword tags we allow the form <meta foo="bar">, but the configurable keyword names must be of the form <meta name="foo" contents="bar">. I don't know that we'd want to fully generalize this, but I'm open to suggestions/recommendations from others. > - The handling of <meta name="date"...> seems to have > disappeared from ExternalParser.cc. Has it been moved > somewhere, been deliberately removed, or just got lost? Bear in mind that 3.1 and 3.2 have been developed in parallel for almost 3 years now, so some changes in one don't necessarily make it to the other. Sometimes that's deliberate, but sometimes they just fall through the cracks. In this case, the date handling was added in 3.1.6, and that's one of the features that still needs to be forward ported to 3.2. So, it never disappeared from 3.2 - it just never appeared yet. > - Where can I read about the reason for changing "factor > slots" from explicit factors to bit masks? I assume that > there was a change to the database format which required > it... It would be really nice to be able to specify > heading factors depending on the heading level again! I'm sure there'd be a fair bit of discussion about this in the htdig-dev archives of 2-3 years ago. I don't think it ever got formally documented elsewhere (yet). The reason was to allow "scoring on the fly". In 3.1, the score is calculated by htdig, so if you change the factors, you need to reindex. In 3.2, because the word database needs to keep all instances of each word, for phrase matching, it only makes sense to also keep a flag indicating word type, so that the score calculations for different word types can be deferred to the search phase. The decision to put all headings into one factor was to reduce the number of bits the flag would take by 5, so the flags can fit in a single byte. We're going to have to increase this anyway, to accomodate custom fields, so it might make sense to reintroduce the distinction between heading levels. Given that a word can't be in more than one heading level at once, this could be encoded in 3 bits, with only a minor complication of the score calculation. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Geoff H. <ghu...@ws...> - 2002-11-08 03:59:06
|
On Thursday, November 7, 2002, at 05:51 AM, Manuel Jes=FAs Aguilera=20 Castro wrote: > Thanks for your help, but I'm trying to execute the command line=20 > using '&' and ';' and the output is always the first of several pages.=20= > I don't know how I can obtain the second, third, etc. pages :( Set the matches_per_page attribute to something quite high in your=20 config file, or the matchesperpage form variable: http://www.htdig.org/attrs.html#matches_per_page http://www.htdig.org/hts_form.html -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Lachlan A. <lh...@ee...> - 2002-11-08 03:44:52
|
Greetings, Most of my changes to defaults.cc haven't made it into CVS yet. (I submitted a different patch before they had been included, and wasn't sure which version I should base it on...) Anyway, I rewrote them as XML before I read that Brian planned to regenerate the XML from the C++ at some stage. I'll post the patch once I get it from home. We also need to remember that there are another dozen or so features and attributes from 3.1.6 to be added before 3.2.0b5 can be released. This means: a) We won't be in a "feature freeze" for a while, and more importantly b) It would be nice to know for certain whether they should be added to defaults.cc or defaults.xml :) $0.02 Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Brian W. <bw...@st...> - 2002-11-08 01:11:27
|
At 10:39 8/11/2002, Geoff Hutchison wrote: >On Wed, 6 Nov 2002, Brian White wrote: > > > I haven't heard anything back about this since I posted > > my patch - I just wanted to check that it was on the radar. > >It's on the radar, but: >a) I'm not sure if it includes the recent changes to defaults.cc from >Lachlan. It probably doesn't - but if you tell me how to get hold of it, I can make sure it does. >b) I haven't heard *anything* from anyone about the recent snapshots with >a variety of changes, including the zlib change from Neil. Well, if it helps I can automatically create a version of defaults.xml from version of defaults.cc *very* quickly. >Obviously if we'd like to release 3.2.0b4 soon, I'd like to know if the >snapshots seem stable before I toss defaults.xml in there. (Not that I >expect it'll cause any breakage, but it's still the way I'd like to do >it.) Fair enough - I would probably do the same. If it helps - I have attached my tool generating defaults.xml from defaults.cc. That should make it easier to transition merged versions. Ok Regs Brian |
|
From: Lachlan A. <lh...@ee...> - 2002-11-08 00:37:49
|
Greetings, I have some questions: - What is the difference between Dictionary::Destroy() and Dictionary::Release() ? - Have the "factor slots" changed at some time? The keyword and description slots seem out of sync between ExternalParser.cc (slots 10 and 11, respectively) and HTML.cc (slots 9 and 10). (Since I'm rather new to this project, I'm hesitant to change any functionality that hasn't been flagged as a bug.) - Is there something in the test suite to test parsing, especially of META tags? - It seems that <meta foo="bar"> is usually treated the same as <meta name="foo" contents="bar">. Can it always be? If so, we could avoid some code duplication (or triplication, since much of it is currently in both ExternalParser.cc and HTML.cc!). - The handling of <meta name="date"...> seems to have disappeared from ExternalParser.cc. Has it been moved somewhere, been deliberately removed, or just got lost? - Where can I read about the reason for changing "factor slots" from explicit factors to bit masks? I assume that there was a change to the database format which required it... It would be really nice to be able to specify heading factors depending on the heading level again! Thanks :) Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Geoff H. <ghu...@ws...> - 2002-11-07 23:42:16
|
On Mon, 4 Nov 2002, Neal Richter wrote: > > mapping and you're consistently doing comparisons for *every* key, > > you're fine. The ordering will be different on the compressed strings, > > but as long as everything is unique, I can't think of a problem. > > This is a lot of calls! Yes. But as I said, I don't remember enough of the Berkeley DB details to know if that's the only method for comparing keys. If we change it, will things stay consistent? -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Geoff H. <ghu...@ws...> - 2002-11-07 23:41:19
|
On Wed, 6 Nov 2002, Brian White wrote: > I haven't heard anything back about this since I posted > my patch - I just wanted to check that it was on the radar. It's on the radar, but: a) I'm not sure if it includes the recent changes to defaults.cc from Lachlan. b) I haven't heard *anything* from anyone about the recent snapshots with a variety of changes, including the zlib change from Neil. Obviously if we'd like to release 3.2.0b4 soon, I'd like to know if the snapshots seem stable before I toss defaults.xml in there. (Not that I expect it'll cause any breakage, but it's still the way I'd like to do it.) -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Torsten N. <tn...@in...> - 2002-11-07 11:37:52
|
On Thursday 07 November 2002 12:16, Manuel Jes=FAs Aguilera Castro wrote: > Hello friends! I need your help. > > I'm using a PHP script that executes a command line like this: > > htsearch -c /etc/htdig.db2002.conf words=3Dciudad > > Why the result page is the same when it executes "htsearch -c > /etc/htdig.db2002.conf words=3Dciudad;page=3D2" ? > > Can you explain me the syntax of htsearch querystring in the command li= ne? > The "page" parameter seems like doesn't work. Ht://Dig may not honor the (more or less) new parameter delimiter charact= er=20 ";", depending on the version of Ht://Dig you are running. Instead, you=20 should use the traditional "&" to separate parameters on the command line= =2E Since you are using a script to post-process the output of htsearch, the=20 difference of using "&" instead of ";" would not be visible to the outsid= e=20 world, so your script still can use ";". hth, Torsten --=20 InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH Waldhofstra=DFe 14 Tel: +49-4101-403605 D-25474 Ellerbek Fax: +49-4101-403606 E-Mail: in...@in... Internet: http://www.inwise.de |
|
From: Lachlan A. <lh...@ee...> - 2002-11-07 00:14:16
|
Thanks for your annotations, Gilles. There were lots of things not yet on the TODO list -- that preliminary list was just to say "these are the files I've checked through"... The list at <http://www.ee.mu.oz.au/staff/lha/htdig-3.1.6-3.2-ports> should now be roughly complete (including the one in your next post) but no promises. I bags all the htdig/* changes. (Or "Dibs me the...", depending on which primary school you went to :) Cheers, Lachlan On Thu, 7 Nov 2002 07:26, you wrote: > Here are my annotations to your TODO list... > [snip] > > Here are additional items, which > didn't make it above... -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Gilles D. <gr...@sc...> - 2002-11-06 15:59:45
|
I found another change that's needed still in 3.2... * htsearch/htsearch.cc (main): Fixed to only show file names in error messages when REQUEST_METHOD not set and -v option given, for security. This is the "filenameok" stuff in 3.1.6's htsearch.cc -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Brian W. <bw...@st...> - 2002-11-06 00:14:07
|
I haven't heard anything back about this since I posted my patch - I just wanted to check that it was on the radar. regs Brian ------------------------- Brian White Step Two Designs Pty Ltd Knowledge Management Consultancy, SGML & XML Phone: +612-93197901 Web: http://www.steptwo.com.au/ Email: bw...@st... Content Management Requirements Toolkit 112 CMS requirements, ready to cut-and-paste |
|
From: Gilles D. <gr...@sc...> - 2002-11-05 23:45:33
|
According to Lachlan Andrew: > Can I suggest that you post the list of changes to be made > and tell everyone the order you plan to make them? That > way others of us will be able to work in parallel without > duplicating effort. > > I've found that the ChangeLog entries are sufficiently > different that it is easier to work with ChangeLog for > 3.1.6, htdig-3.1.5-3.1.6.diff and the 3.2 source. I've > been going through and deleting the changes that have > already been made (or obviated by other changes), and a > very preliminary list is at > <http://www.ee.mu.oz.au/staff/lha/htdig-3.1.6-3.2-ports> > If you have a more comprehensive list, I'd love to see it. Good work! Thanks for your efforts. Here are my annotations to your TODO list... TODO: htcommon/HTML.cc: metadatetags, descriptionMatch -- latter is for description_meta_tag_names attribute strlen(skip_start) taken out of loop -- no longer applicable, as skip_start now String type ignore_alt_text remove 'which = -1; //What does it do?' on line 948... htlib/Dictionary.cc: add 'e->next = NULL' after line 251 -- I also wonder if a more careful audit of the Dictionary class isn't needed. It seems some changes were made to it in 3.2, which in retrospect may be unnecessary after this fix is implemented, and may lead to memory leaks. (e.g. Release no longer releases anything.) Apart from the addition of "const" keywords in 3.2, these two versions of the class should be the same, but which one is correct? htsearch/Display.cc: displayHTTPheaders -- for search_results_contenttype attribute Remove "ANCHOR" save instance of URL for star_patterns and template_paterns -- I think this bit is unique to 3.1.6 URL handling HtURLRewriter -- part of search_rewrite_rules handling max_excerpts attribute anchor_target attribute relative dates -- i.e. make startyear et al. handle relative date ranges in Display.cc htsearch/parser.cc: boolean_syntax_errors and boolean_keywords internationalisation multimatch_factor (listed below as 'multimatch_method') -- oops! that typo made it into RELEASE.html too. -- note: still VERY buggy in 3.1.6 - check changes to prefix_match_character, now in QueryLexer.cc -- related to "list-all" feature below htsearch/htsearch.cc: boolean_keywords internationalisation Here are additional items, which were mentioned in the list I sent to Jessica in August and Jim in October, but didn't make it above... htcommon/defaults.cc: > - add startyear et al. to defaults.cc -- need full descriptions of these installdir/english.0: > - fuzzy endings patch and updated english.0 file -- fixes to htfuzzy/EndingsDB.cc, htfuzzy/Endings.cc already done, but 2 rounds of changes to english.0 missing contrib/conv_doc.pl, contrib/doc2html/*: > - get updated external parser scripts into contrib directory > (fix eof handling bug in .pl scripts) -- if a newer doc2html release is available, we should use it htsearch/htsearch.cc, htsearch/parser.cc: > - list-all feature in htsearch for a query of * or prefix_match_character -- this will be tricky for 3.2, because of the changes in the databases htdig/Retriever.cc: > - ignore_dead_servers attribute ... plus the two I said I'd handle... htcommon/HtSGMLCodec.cc: > - translate_latin1 attribute, with hooks into SGMLCodec class htdig/htdig.cc: > - better handling of htdig -m option It should also go without saying that any new attribute should also be added and described in defaults.cc in 3.2. I hope to do a more thorough ChangeLog audit before 3.2.0b5 is released, but this TODO list should give us a lot to chew on in the meantime. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Gilles D. <gr...@sc...> - 2002-11-05 22:23:13
|
According to Martin =?iso-8859-2?Q?Ma=E8ok?=: > On Sat, Oct 26, 2002 at 10:06:12PM -0500, Geoff Hutchison wrote: > > Instead, I'd suggest using the SourceForge bug tracker for ht://Dig > > http://sourceforge.net/tracker/?atid=3D104593&group_id=3D4593&func=3Dbrow= > se > > OK, I've tried to avoid it ;-) If this doesn't get resolved soon > I will submit it. > > > >I found it with wwwoffle cache indexing scripts. htdig 3.1.x worked > > >well but after upgrading to 3.2.0b4-072201 it broke. The cached > > >pages are under "/search/index" directory and "/index" is > > >disallowed. You can see that 3.2.0b rejects "/search/index" in > > >debug output: > >=20 > > Yes. I can't see anything in particular that would have solved this > > in the meantime (which surprises me since I seem to remember this > > before). For my own benefit, could you confirm that it fails for you > > on the current snapshot? > > Hm, I've got sucking slow and expensive dialup here (Czech Republic, > monopolistic phone operator ... you know) so I would like to avoid > downloading extra 2MB ... I don't think you need to do this. I'm pretty sure, by looking at the code, that it's still a problem in current snapshots. I don't recall any recent changes to this part of things. The problem is that while 3.1 used StringList::Compare, which does an "anchored" match (it doesn't search for a substring like StringList::FindFirst does), the 3.2 code uses HtRegex::match, which does an unanchored match. So, in 3.2 it matches the substring anywhere in the URL, instead of just at the start of the path component of the URL. > Back to topic - I've got ht://Dig 3.2.0b4-072201 source code here and > I tried to fix it after some short time of looking at the code. See > the attachment and review it cause I'm not too familiar with htdig > code internals, this is just a quick-try-hack, but it seems to be > working here but not heavily tested though... > > By the way, I think that using regular expressions here is a way too > big hammer for this simple task (i.e. just for testing if one string > is equal to or just an extension of another). Robots.txt is not > defined to contain regular expressions but htdig handles disallow > lines as if they are regexps. Are you sure that won't cause any > problems if somebody puts some "weird" characters in it? I believe your patch will fix this problem correctly. But I think you're right about "weird" characters causing problems. The pattern that's built from disallow lines is given to HtRegex::set, not HtRegex::setEscaped, so regular expression meta characters are taken as operational. You're also right about this being too big a hammer for the task. The old way did the job according to the standard, so that's ultimately what we should go back to. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Neal R. <ne...@ri...> - 2002-11-05 00:25:29
|
> > Do we REALLY need this function to goto the complexity of calling > > 'Unpack' > > and comparing the keys? Why not treat the entire key bitstream after > > the > > word-string as a binary compare and return? > > That depends on how the packing is done. Udi Manber (among others > probably) outlined various strategies for comparing compressed strings > w/o unpacking. In this case, I think as long as you have a one-to-one > mapping and you're consistently doing comparisons for *every* key, > you're fine. The ordering will be different on the compressed strings, > but as long as everything is unique, I can't think of a problem. FYI: For a small dataset of ~630 documents with near 1000 word-rows per document added to WordDB 713123 0.00 0.00 word_db_cmp(__db_dbt const *, __db_dbt const *) 713123 0.00 0.00 WordKey::Compare(char const *, int, char const *, int) 580392 0.00 0.00 WordKey::UnpackNumber(unsigned char const *, int, unsigned int &, int, int) This is a lot of calls! The number of calls to word_db_cmp won't chage, but the others can be! Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Gilles D. <gr...@sc...> - 2002-11-04 21:45:04
|
According to Gabriele Bartolini: > as Hans Sandsdalen posted early today, he is encountering a problem > with base URLs tags: at a first glance, it looks like the 'port' part of > URL is not considered. > > This is the HTML instruction he's got in the document: > > <base href=3D"http://www2.spacetec.no:8080/www2/docs/Rutiner/Adm-rutiner/" > > and here is what htdig shows: > > Tag: base href=3D"http://www2.spacetec.no/www2/docs/Rutiner/Adm-rutiner/" > > As you can easily notice, the port info as been lost. .... > Again I don't know whether this could solve the problem or not (I > also feel stupid because I didn't ask him immediately the version he's > using - this bug could already be solved). As Hans reported, a newer snapshot did solve the problem. I suspect this set of changes from just over a year ago was the cure... Fri Sep 14 22:12:56 2001 Gilles Detillieux <gr...@sc...> * htcommon/URL.h: Moved DefaultPort() from private to public for use in HtHTTP.cc. Fri Sep 14 09:25:20 2001 Gilles Detillieux <gr...@sc...> * htnet/HtHTTP.cc (SetRequestCommand): Add port to Host: header when port is not default, as per RFC2616(14.23). Fixes bug #459969. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Geoff H. <ghu...@ws...> - 2002-11-04 14:58:31
|
Sorry, I've been really busy and haven't had much time to comment on this. On Saturday, November 2, 2002, at 08:29 PM, Gilles Detillieux wrote: > How much of this database fragmentation would be due to the fact that > there are records of different lengths, and how much would be due to > updating a given record from one length to a larger length. It's the latter. > this information in memory as it did in 3.1, and then just dumped > out all the records like above after the whole document is parsed. > That way, none of the records ever have to be updated and lengthened. No, words are only sent out once the document is parsed. It's not writing "along the way" if you will. But remember that while the document itself has the vast majority of the words, we still would have to update the records when we're indexing another document and happen to hit a link back to the first one. :-( > I implemented this type of caching very quickly using the STL with > slight modifications to the WordDB object. Mifluz contains a > WordDBCache > object, but 3.2 doesn't use it, and it's excessively complicated in > my opinion. Basically, it does two things. One, it caches up to a certain amount of data--and this is already used by 3.2 by default. Two, the mifluz code will write out small temporary files along the way. As I wrote the mifluz-merge code, it collapsed the temporary files every 10 documents (and I've checked performance by varying that number). There is a decided performance improvement because the data can be stuffed back into the main DB with better ordering. There's also better performance on the initial writes because you're adding to a temporary file with a smaller lookup penalty. > Do we REALLY need this function to goto the complexity of calling > 'Unpack' > and comparing the keys? Why not treat the entire key bitstream after > the > word-string as a binary compare and return? That depends on how the packing is done. Udi Manber (among others probably) outlined various strategies for comparing compressed strings w/o unpacking. In this case, I think as long as you have a one-to-one mapping and you're consistently doing comparisons for *every* key, you're fine. The ordering will be different on the compressed strings, but as long as everything is unique, I can't think of a problem. -Geoff |
|
From: Gabriele B. <g.b...@co...> - 2002-11-04 09:46:31
|
Ciao guys, as Hans Sandsdalen posted early today, he is encountering a problem with base URLs tags: at a first glance, it looks like the 'port' part of URL is not considered. This is the HTML instruction he's got in the document: <base href=3D"http://www2.spacetec.no:8080/www2/docs/Rutiner/Adm-rutiner/" and here is what htdig shows: Tag: base href=3D"http://www2.spacetec.no/www2/docs/Rutiner/Adm-rutiner/" As you can easily notice, the port info as been lost. I had a quick look at the code, and the instructions that concern this are in the HTML.cc file, precisely: <<< case 23: // base { if (!attrs["href"].empty()) { URL tempBase(transSGML(attrs["href"])); *base =3D tempBase; } break; } >>> The first thing that I noticed is that we are using the assignment operator which is not defined in the URL class; I know this doesn't mean it is a *bad* thing (it should work with a simple attribute by attribute copy), I just feel more relaxed by providing our own. Thus, I changed the code of URL.h and .cc files in order to manage it. Again I don't know whether this could solve the problem or not (I also feel stupid because I didn't ask him immediately the version he's using - this bug could already be solved). Ciao -Gabriele =20 --=20 Gabriele Bartolini - Web Programmer Comune di Prato - Prato - Tuscany - Italy g.b...@co... | http://www.comune.prato.it > find bin/laden -name osama -exec rm {} ; |
|
From: Neal R. <ne...@ri...> - 2002-11-03 23:23:34
|
> How much of this database fragmentation would be due to the fact that > there are records of different lengths, and how much would be due to > updating a given record from one length to a larger length. > > E.g., if instead of having a whole bunch of entries like this... > > word // DocID // flags // location -> anchor > > what if we had entries like this... > > word // DocID -> flags/location/anchor flags/location/anchor ... Keeping the location code in the key eliminates duplicate keys, which probably helps BDB a little. The rest can go into the value. > but instead of making database updates each time another word is parsed > (as is done now in 3.2, if I'm not mistaken), how about if htdig stored > this information in memory as it did in 3.1, and then just dumped > out all the records like above after the whole document is parsed. > That way, none of the records ever have to be updated and lengthened. > They're just written once. I implemented this type of caching very quickly using the STL with slight modifications to the WordDB object. Mifluz contains a WordDBCache object, but 3.2 doesn't use it, and it's excessively complicated in my opinion. I'm still evaluating the results of this kind of caching, but at first glance it seems to help a lot. I also added a few lines of code to flush this cache every 250 documents, which makes it even faster. > I think even optimizations like this become easier if we don't dump out > any of the db.words.db records until a document is fully parsed, and then > dump them all out at once. Am I wrong? I know that 3.2 is supposed to > allow indexing a live database on the fly, and still have it be searchable, > but that doesn't mean the DB needs to be updated a word at a time. Doing > it a document at a time should make sense, just as db.docdb is updated. I'll try to submit a patch early this week with the caching added to the WordDB object, flushing the cache every X documents, and improving the efficiency of the word_db_cmp function. See my previous post on word_db_cmp, which I wrote after having a few beers.. the writing is a bit tangled ;-) Thanks! Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Lachlan A. <lh...@ee...> - 2002-11-03 22:55:02
|
Greetings Jim, Can I suggest that you post the list of changes to be made and tell everyone the order you plan to make them? That way others of us will be able to work in parallel without duplicating effort. I've found that the ChangeLog entries are sufficiently different that it is easier to work with ChangeLog for 3.1.6, htdig-3.1.5-3.1.6.diff and the 3.2 source. I've been going through and deleting the changes that have already been made (or obviated by other changes), and a very preliminary list is at <http://www.ee.mu.oz.au/staff/lha/htdig-3.1.6-3.2-ports> If you have a more comprehensive list, I'd love to see it. Cheers, Lachlan > From: Gilles Detillieux <gr...@sc...> > Subject: Re: 3.2.0b4 release > To: gre...@yg... (Jim Cole) > Date: Wed, 30 Oct 2002 14:56:27 -0600 (CST) > > Comparing ChangeLog entries between 3.1.6 and the 3.2 cvs > would be the first step, and would find most of the > missing stuff. Note that there may be differences in > wording between the two, especially if someone other than > me make the entry in the 3.2 ChangeLog. If you can't > find anything close in the 3.2 ChangeLog to a given 3.1.6 > entry, then comparing the specific changes in the source > against the 3.2 source would be the next step. I'd be > glad to answer any questions you have at that stage, > including punting a few ChangeLog entries my way for my > verification. > > Potentially all entries since 3.1.5 was released would > need to be checked. That may seem like a lot, given that > 3.1.5 was released almost 2 full years before 3.1.6, but > the CVS tree for 3.1.x was dormant for a long time after > 3.1.5. -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |