You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
| 2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
| 2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
| 2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
| 2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
| 2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|
From: Geoff H. <ghu...@us...> - 2002-09-22 07:13:50
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b4: In progress
(mifluz merge essentially finished, contact Geoff for patch to test)
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
SHOWSTOPPERS:
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug)
* Not all htsearch input parameters are handled properly: PR#648. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* If exact isn't specified in the search_algorithms, $(WORDS) is not set
correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can
we fix this?)
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#859)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
NEEDED FEATURES:
* Field-restricted searching.
* Return all URLs.
* Handle noindex_start & noindex_end as string lists.
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient.
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#648.) Also make sure these config
attributes are all documented in defaults.cc, even if they're only set by
input parameters and never in the config file.
* Split attrs.html into categories for faster loading.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
* TODO.html has not been updated for current TODO list and completions.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
(Does Loic's new database code make this work?)
* The code needs a security audit, esp. htsearch
* URL.cc tries to parse malformed URLs (which causes further problems)
(It should probably just set everything to empty) This relates to
PR#348.
|
|
From: Gilles D. <gr...@sc...> - 2002-09-21 02:45:29
|
Hi again, folks. Another bug I discovered in htdig while I was
experimenting with different approaches to indexing the Geocrawler
archives was that its removal of double slashes in URL::normalizePath()
may cause problems. Here's the code in question:
//
// Furthermore, get rid of "//". This could also cause loops
//
while ((i = _path.indexOf("//")) >= 0 && i < pathend)
{
String newPath;
newPath << _path.sub(0, i).get();
newPath << _path.sub(i + 1).get();
_path = newPath;
pathend = _path.indexOf('?');
if (pathend < 0)
pathend = _path.length();
}
The problem with this is that is assumes the path refers to a standard
hierarchical filesystem where a null path component is taken as the same
as a ".". That assumption can break down when the path is process by a
script rather than by the filesystem. (There was a rant about a similar
bug in URL handling in Office XP in Woody's Office Watch sometime ago
just before Office XP's release.) The easy fix I can think of would be
to prefix the while loop above with this:
if (config.Boolean("remove_double_slash", 1))
and set that attribute to true by default in htcommon/defaults.cc.
Setting it to false in your htdig.conf would turn off this feature when
it causes problems. It's still a good feature to have in most cases of
normalizing conventional filesystem paths. The other approach I thought
of, which would take more work, would be to have a StringList attribute
called remove_url_path_part or something of the sort, which would define
all the substrings to be stripped out. The complication is that not all
the stuff that URL::normalizePath() strips out is simple substrings. I'm
leaning to the simpler fix. Thoughts?
The way I stumbled into this problem with Geocrawler was when I used a
start_url like
http://www.geocrawler.com/archives/3/8822/2002/8/
to index the Aug 2002 htdig-general archives. Normally, the path component
after the month is a starting document number for a page of 50 messages in
reverse chronological order. E.g., for August,
http://www.geocrawler.com/archives/3/8822/2002/8/0/
is the last 50 messages of the month,
http://www.geocrawler.com/archives/3/8822/2002/8/50/
is the next 50, and so on, leading to URLs for the individual messages
like this:
http://www.geocrawler.com/archives/3/8822/2002/8/100/9269993/
But if you omit the starting document number, as in the first URL above,
Geocrawler generates URLs for messages with a null starting number, as
http://www.geocrawler.com/archives/3/8822/2002/8//9269993/
which work fine until htdig removes one of the slashes, at which point
geocrawler tries a non-existant starting number, rather than recognising
the last number as a message ID (because the position in the path is
wrong), so you'd end up indexing a lot of "No Results Found" pages.
The workaround in this case was easy enough - I just used a starting
number of 0/ at the end of the start_url, and all was good. I also
used url_rewrite_rules (and later an external converter script which
also had other "cleanups") to normalize the starting number in all
message URLs to 0, so that you wouldn't get the same message indexed
under multiple start numbers. However, not all such situations will
have such an easy workaround.
--
Gilles R. Detillieux E-mail: <gr...@sc...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
|
|
From: Gilles D. <gr...@sc...> - 2002-09-20 22:54:36
|
Hi, folks. I've been giving some thought to how htmerge -m works when it finds the same URL in both databases. Right now, it just tosses out the older docdb record and keeps the newer one. However, it occurs to me that this could cause a loss of information that's collected during a full dig, if you then merge in a more recent partial dig. A full dig would likely harvest more link descriptions for a given URL, and a higher backlink count, than would a partial dig. So, if the record from the partial dig is more recent, it will clobber the more complete information from the corresponding record in the full dig. It seems to me that htmerge should look at both DocumentRef records and take the higher backlink count, as well as combining all the link description text (weeding out duplicates, presumably). I guess it would then also need to generate new wordlist entries for any new description words for the new DocID. Does this make sense? This occurred to me as I was thinking about how htdig handles HTTP redirects. In that case, it transfers all the old pre-redirect descriptions to the new redirected URL's DocumentRef. It also takes the smaller of the two hop counts, but it doesn't take the larger of the two backlink counts, which strikes me as a bit of a bug there too. Am I wrong? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Geoff H. <ghu...@ws...> - 2002-09-20 01:41:08
|
On Thu, 19 Sep 2002, Neal Richter wrote: > We would also be able to avoid any dynamic resizing of the LOCATION > Value-field in BDB by making it a fixed width. > > Ex: Let's say this LOCATION-value is 'Full' @ 32 characters. Further > locations of 'affect' in doc 400 get new rows As I said before, this is probably a good idea. But it's going to take some work to get the right balance. Since the keys need to be unique, you have to introduce at least some "padding" by putting a row field in the key. word // doc id // row OK, now you have a fixed-length record for: (location // field // anchor) (location // field // anchor) ... So the trick will be to find: a) A short field length for "row" to minimize overhead. b) The "right" fixed-length record for a "row." (a) is partially offset by the reduction in the BDB control structures if you cut down on the number of keys. But you don't want to make it too small since you don't know how many words will be in long documents. We can make guesses fortunately. e.g. I just did counting from Project Gutenberg's text versions of "Adventures of Huckleberry Finn" (563KB) most frequent: "and" 6138 times "20,000 Leagues Under the Sea" (567KB) most frequent: "the" 7469 times King James Bible (Old & New Test.) (4240KB) most freq.: "the" 62162 times "and" 38611 times "of" 34506 times (b) is trickier. For short rows, you'll waste space since you've reserved this record you aren't using. But the shorter it is, the fewer keys you can condense. So the question is how often we'd waste space (and how much) on short rows, versus how much we (might) regain from BDB control structures. Experimentation will be needed. Neil, do you think we can actually save bits across just the key/record pair? It seems like you'll need to add bits for the row location. -Geoff |
|
From: Geoff H. <ghu...@ws...> - 2002-09-20 00:46:56
|
On Thu, 19 Sep 2002, Gilles Detillieux wrote: > Sounds like an excellent idea to me. I'm rather surprised they didn't do > that in mifluz already (or is something like this in the newer code?). OK, so to make this clear, there's a difference between mifluz (which is more backend) and the way we use it. We can set whatever key specification we want to mifluz. So we currently use a key specification like: word // doc id // location // flags Now we should also remember that Loic was essentially the *only* mifluz developer. > I'm kind of being conservative. Based on the total lack of > recent progress in mifluz, the very quite mailing list, and the much > smaller user base I have some worries about just how good the new mifluz > code is. Keep in mind that the version of htword/mifluz we're using is 0.13. I know for *certain* there are bugs in it, and reading the ChangeLog for mifluz, I know they're not just in the compression code. Keep in mind that there are some fairly decent regression tests for mifluz. I don't see inactivity on development as necessarily indicative of stability! (Otherwise, I'd really worry about bugs whenever I use LaTeX.) The problems with the merge were due to differences in the mifluz *interfaces* (and my lack of free time to do the merge) as well as Loic's disappearance. I haven't had to do significant changes to the mifluz code, which does pass regression tests. I performed the merge outside the CVS tree so that we _can_ get testing of reliability. But this is completely distinct about how to handle _our_ key/record implementation. -Geoff |
|
From: Neal R. <ne...@ri...> - 2002-09-19 23:22:01
|
On Thu, 19 Sep 2002, Gilles Detillieux wrote: > According to Neal Richter: > > 1. Add a new config verb to let users use zlib WordDB-page compression. > > This would be an option to let users who run into this error: > > > > FATAL ERROR:Compressor::get_vals invalid comptype > > FATAL ERROR at file:WordBitCompress.cc line:827 !!! > > > > If you look into the db/mp_cmpr.c code (Loic's Compressed BDB page code) > > you'll find these two functions: > > CDB___memp_cmpr_inflate(..) > > CDB___memp_cmpr_defalte(...) > ... > > Merging Loic's latest mifluz is supposed to fix this problem (Geoff > > and I have been working on this), but so far the merge is fairly complex > > and needs much more work and long term testing. This is a decent > > solution. > > Sounds reasonable as an interim solution. I wonder, though, if > it wouldn't be a quicker/easier fix to backport just the inflate and > deflate code from the latest mifluz package to the existing 3.2.0b4 code. > Would that fix this particular problem without all the headaches of > merging in all the latest mifluz code? I tried to do just that (independent of Geoff). Unfortunately Loic basically reimplemented and restructured so much code that its very hard to divide the merge. > > > 2. The inverted index is not very efficient in general. > > > > The current scheme: > > > > WORD DOCID LOCATION > > affect 323 43 > > affect 323 53 > ... > > A more efficient inverted system > > > > affect 323 43, 53 > ... > > If the fixed width Location field was around 256 characters, this would > > allow roughly 40-50 1,2,3 & 4 digit location codes... likely resulting the > > vast majority of the time a second row is not needed. For large > > documents, this would change but still be much more efficient. > > > > Eh? Feedback? > > Sounds like an excellent idea to me. I'm rather surprised they didn't do > that in mifluz already (or is something like this in the newer code?). > This does mean a deviation from the mifluz code base, but it seems > that's inevitable anyway, given the efforts to crowbar the latest code > into ht://Dig, and the lack of support from the mifluz developers. I'm kind of being conservative. Based on the total lack of recent progress in mifluz, the very quite mailing list, and the much smaller user base I have some worries about just how good the new mifluz code is. I would like to see parallel development for a while till we get the mifluz-merge tree VERY solid. Maybe even finish 3.2 without the merge and start the merge in 3.3? > I guess it also means making the change twice - once in the current > ht://Dig code and again after the mifluz code merge. Or is all this > at a level that can be done with minimal changes after the merge? Probably both, but the second port should be pretty straightforward. -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Geoff H. <ghu...@ws...> - 2002-09-19 23:14:09
|
On Thu, 19 Sep 2002, Neal Richter wrote: > Merging Loic's latest mifluz is supposed to fix this problem (Geoff > and I have been working on this), but so far the merge is fairly complex > and needs much more work and long term testing. This is a decent > interim solution. Obviously I'm more concerned with the mifluz merge and figuring out the lousy performance. But if you've seen that switching to zlib or the newer codec seem to solve the database bugs, then I'm happy with this as an interim solution. We could use this for a 3.2.0b4 (which we need) and then work on the mifluz merge for 3.2.0b5. > WORD DOCID LOCATION > affect 323 43 So first off, I should point out that it's not quite as bad as this. Loic and I worked on "key compression," which means that the database doesn't actually store multiple keys when they're only slightly different. There's also a rationale behind this system--it was faster to keep all these keys than changing the length of the records: > affect 323 43, 53 > affect 336 14, 148, 155 > Value-field in BDB by making it a fixed width. > > Ex: Let's say this LOCATION-value is 'Full' @ 32 characters. Further > locations of 'affect' in doc 400 get new rows OK, so having a fixed width and multiple rows may be a reasonable idea, but your description isn't very workable. For one, the keys need to be unique. So you'd want something like: Key: WORD DOCID ROW Record: Location/Flags/Anchor designation list The key would be to come up with a compact binary representation of these. Using characters to store integers is a bit inefficient. :-) More on that later, perhaps. -Geoff |
|
From: Gilles D. <gr...@sc...> - 2002-09-19 21:19:27
|
According to Neal Richter: > 1. Add a new config verb to let users use zlib WordDB-page compression. > This would be an option to let users who run into this error: > > FATAL ERROR:Compressor::get_vals invalid comptype > FATAL ERROR at file:WordBitCompress.cc line:827 !!! > > If you look into the db/mp_cmpr.c code (Loic's Compressed BDB page code) > you'll find these two functions: > CDB___memp_cmpr_inflate(..) > CDB___memp_cmpr_defalte(...) ... > Merging Loic's latest mifluz is supposed to fix this problem (Geoff > and I have been working on this), but so far the merge is fairly complex > and needs much more work and long term testing. This is a decent > solution. Sounds reasonable as an interim solution. I wonder, though, if it wouldn't be a quicker/easier fix to backport just the inflate and deflate code from the latest mifluz package to the existing 3.2.0b4 code. Would that fix this particular problem without all the headaches of merging in all the latest mifluz code? > 2. The inverted index is not very efficient in general. > > The current scheme: > > WORD DOCID LOCATION > affect 323 43 > affect 323 53 ... > A more efficient inverted system > > affect 323 43, 53 ... > If the fixed width Location field was around 256 characters, this would > allow roughly 40-50 1,2,3 & 4 digit location codes... likely resulting the > vast majority of the time a second row is not needed. For large > documents, this would change but still be much more efficient. > > Eh? Feedback? Sounds like an excellent idea to me. I'm rather surprised they didn't do that in mifluz already (or is something like this in the newer code?). This does mean a deviation from the mifluz code base, but it seems that's inevitable anyway, given the efforts to crowbar the latest code into ht://Dig, and the lack of support from the mifluz developers. I guess it also means making the change twice - once in the current ht://Dig code and again after the mifluz code merge. Or is all this at a level that can be done with minimal changes after the merge? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Neal R. <ne...@ri...> - 2002-09-19 20:17:57
|
Hey all,
I've got two proposals here for the WordDB:
1. Add a new config verb to let users use zlib WordDB-page compression.
This would be an option to let users who run into this error:
FATAL ERROR:Compressor::get_vals invalid comptype
FATAL ERROR at file:WordBitCompress.cc line:827 !!!
If you look into the db/mp_cmpr.c code (Loic's Compressed BDB page code)
you'll find these two functions:
CDB___memp_cmpr_inflate(..)
CDB___memp_cmpr_defalte(...)
They are drop in zlib-based replacements for the
(*cmpr_info->uncompress) & (*cmpr_info->compress) function-pointer calls.
Yes, the compression isn't as good as the ad-hoc bit-stream compression
in WordDBCompress, WordBitCompress, and WordDBPage. The advantage is that
its fairly bulletproof (zlib) and better than turning off wordDB
compression altogether with the 'wordlist_compress' config verb.
Merging Loic's latest mifluz is supposed to fix this problem (Geoff
and I have been working on this), but so far the merge is fairly complex
and needs much more work and long term testing. This is a decent
solution.
2. The inverted index is not very efficient in general.
The current scheme:
WORD DOCID LOCATION
affect 323 43
affect 323 53
affect 336 14
affect 336 148
affect 336 155
affect 351 43
affect 358 370
affect 399 51
affect 400 10
affect 400 86
affect 400 95
affect 400 139
affect 400 215
affect 400 222
affect 400 229
A more efficient inverted system
affect 323 43, 53
affect 336 14, 148, 155
affect 351 43
affect 358 370
affect 399 51
affect 400 10, 86, 95, 139, 215, 222, 229
We would need to augment the WordDB and associated classes to support the
value parsing..
We would also be able to avoid any dynamic resizing of the LOCATION
Value-field in BDB by making it a fixed width.
Ex: Let's say this LOCATION-value is 'Full' @ 32 characters. Further
locations of 'affect' in doc 400 get new rows
affect 400 10, 86, 95, 139, 215, 222, 229
affect 400 300, 322, 395, 439, 516
The objects would keep track of the field lengths and create new rows as
needed.
If the fixed width Location field was around 256 characters, this would
allow roughly 40-50 1,2,3 & 4 digit location codes... likely resulting the
vast majority of the time a second row is not needed. For large
documents, this would change but still be much more efficient.
Eh? Feedback?
EXTRA NOTE: Memory Leak detection:
I also wanted to make the developers aware (if you aren't already) of
Valgrind. It's nice open-source memory error checking tool.
In general you use it like this:
Valgrind htdig xxx xxx xxx
It seems to be pretty comparable to both Purify and Insure. It's not
going to get you as good a result as a compile-or-link time code
instrumentation, but its better than nothing at all. Interestingly Isure
ships a program called 'Chaperon' that you use in the same way as
Valigrind on Debug binaries. I haven't looked into it indetail, but my
guess is that both build on the native memory debugging facilities of
glibc.
Valgrind Home Page
http://developer.kde.org/~sewardj/
KDE GUI frontend to Valgrind
http://www.weidendorfers.de/kcachegrind/index.html
--
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
|
|
From: Geoff H. <ghu...@ws...> - 2002-09-18 15:11:20
|
On Thu, 12 Sep 2002, Brian White wrote: > Well, all except the "until they have granularity" bit - > what does that mean? In theory, we can do multi-write to the databases by locking *parts* of the database, rather than the full file. This would be through the Berkeley DB code and would obviously make file locking obsolete. > 1) What do we do about "dangling" locks? ( where > lock files get left behind due to a crash or some > other kind of bug) I think we'd want some sort of timeout. Using PIDs wouldn't help the typical problem, which is running multiple programs on the same file. Yes, out-of-sync clocks are a problem, but if we pick a long enough timeout and document the behavior (so sysadmins can kill the lockfiles if there are problems) it should be OK. > 2) If locks have to be valid across machines, the > TMPDIR isn't going to work unless it is explicitly > put into a common area? True. Of course the rundig scripts typically set TMPDIR to the database directory, but you're right that we can't count on this. > 3) Also - do locks need to work between different > programs? What if they have different > permissions > Another solution is to use the process id - but that No, I don't think we want locks to be tied to the PID. The same PID is already prevented from opening the same file twice--you'd need to have two copies of DocumentDB, etc. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Stefan S. <Tal...@in...> - 2002-09-18 08:54:24
|
On 18.09.2002 1:33 Uhr, Jim Cole <gre...@yg...> wrote: > Going back to the basics, how are you indexing? Are you using > the rundig script? Are you aware that by default rundig always > calls htdig with the -i option, causing the current databases > to be deleted before it starts indexing? > > I apologize in advance if I am suggesting the obvious to you > here; but a couple symptoms you describe make it sound like the > databases are being deleted and then rebuilt from scratch. Holy DIG! I feel soooo stupid now! Of course i used rundig, but since i called with -a, i figured it should not reindex the whole site. Looking at the rundig script, of course there is this tiny little "-i " flag in there. Bummer. I apologize "deeply bend" to all of you for not seeing this. Usually i am quite good at first looking at the most basic solution for a problem (like connecting the PowerPlug first to a Printer before debugging the Printer-Driver) ;-) Sorry again! -- Stefan Seiz <http://www.StefanSeiz.com> Spamto: <bi...@im...> |
|
From: Jim C. <gre...@yg...> - 2002-09-17 23:34:03
|
Stefan Seiz's bits of Thu, 12 Sep 2002 translated to: >I ran into a farily nasty problem running htdig on: >Mac OS X Server 10.1.2 Server (5P68) [ Mac OS X 10.1.2 (5P48) I am also unable to duplicate the problem. I am currently using Mac OS X 10.2 (6C115). >I realized htdig always REINDEXING my complete site no matter if a >document's last modification date matched the servers last mod date or not. >Tracking down things, I found that ref->DocTime() ALWAYS returns 0 even if >the given url has a mod date (m:XXXXXXX) value in the database (verified >with htdump). Going back to the basics, how are you indexing? Are you using the rundig script? Are you aware that by default rundig always calls htdig with the -i option, causing the current databases to be deleted before it starts indexing? I apologize in advance if I am suggesting the obvious to you here; but a couple symptoms you describe make it sound like the databases are being deleted and then rebuilt from scratch. Jim |
|
From: Gilles D. <gr...@sc...> - 2002-09-17 22:58:27
|
According to Lachlan Andrew: > Here is a patch (url-format.patch) to allow the format of > external_protocol URLs be <protocol>:<path>, rather than > <protcol>://<host>/<path>. It seems to work in the cases > I've tested, but I'm not sure how to try it on the test > suite, so I hope I haven't broken anything else... Please > let me know if it needs more work. > > The patch is relative to 3.2.0b4-20020616. Let me know if > you need it against a more recent snapshot. > > The HTML table for the description of the output expected > from the external transport was also poorly formatted in > this snapshot, so I've included another patch > (return-field-table.patch) to fix that, if it hasn't > already been done. If you apply this patch, do so first. Thanks for the patches. I've gotten rid of the rowspans in the table in external_protocols' description, as per your 2nd patch. For the 1st one, I don't have time to test it myself, but I'd like to wait for comments/testing from other developers if any is forthcoming. To answer some of the questions you ask in the code of your patch, the .get() after a sub seemed to be needed with some compilers, to avoid warnings. For whatever reason, we couldn't just assign the result of a .sub() to another String, even though .sub() is supposed to return a String. The .get() gives us a (char *) from that, and everything seems to work well that way. On line 324 of URL.cc, you say "(should also check the slashes are actually there...)". I agree, the code should do this. The way the slash count is encoded in the Dictionary entries is kludgy, but I think there are similar kludges elsewhere in the code. It's also self-contained in one method of the URL class, so I don't have a problem with it. Without actually testing it yet, that's about all I can suggest. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Gilles D. <gr...@sc...> - 2002-09-17 20:55:31
|
According to Stefan Seiz:
> On 17.9.2002 20:21 Uhr, Geoff Hutchison <ghu...@ws...> wrote:
> > 2) Where do you determine that ref->DocTime() is returning 0?
> > I ask, in part because htdump is going to access this as well:
> > fprintf(fl, "\tm:%d", (int) ref->DocTime());
> I added a traceprint to Retreiver.cc inside the Retriever::parse_url(URLRef
> &urlRef) routine like so:
>
> --- snip ---
> if (ref)
> {
> //
> // We already have an entry for this document in our database.
> // This means we can get the document ID and last modification
> // time from there.
> //
> current_id = ref->DocID();
> date = ref->DocTime();
> if (debug > 2)
> {
> cout << "\nDOC MATCHED DB!!! \n" << endl;
> cout << "DocTime Date is: " << date << endl;
> }
> --- snap ---
I think you might need a cast up there, i.e.:
cout << "DocTime Date is: " << (int) date << endl;
I don't know if there's a (ostream) << (time_t) operator defined, and it
might not be automatically casting date to (int) on its own. See if that
makes a difference.
> > 3) Are you sure the server is returning a Last-Modified header for files?
> Yes, I snooped the wire ;-)
>
> > 4) Does the server properly handle the If-Modified-Since header?
> > (To see that this header is sent, check in Document.cc line 525 or so for
> > the output sent by htdig.)
> It's apache 1.3.26, so I guess it should. But I think htdig only sends the
> if-Modified since header if it finds a date for an url in the current
> database and as I assume that doesn't happen, so the If-Modified-Since
> header never makes it's way out.
>
> Here's an example url from my htdump file to prove a date is in there:
>
> 0 u:http://www.CENSORED.com/YADDA.html t:CENSORED a:0
> m:873819058 s:280 H: CENSORED h: l:1031854616
> L:0 b:1 c:0 g:0 e: n: S: d: A:
Well, that "m:" value is definitely non-zero, and definitely not as large
as the current time, so it does seem to be getting, parsing, and storing
Last-Modified header dates. But in your reply to my message on [htdig],
you said...
According to Stefan Seiz:
> On 17.9.2002 20:23 Uhr, Gilles Detillieux <gr...@sc...> wrote:
> > If that's the part you suspect is failing, then you should be able
> > to confirm that by running htdig -vvv. Look for the messages where
> > it outputs the Last-Modified header, and then says something like
> > "Converted ... to ...", which shows the original and regenerated date
> > string after parsing. If the second one is wrong, then you are right in
> > that the problem is somewhere in the parsing. In that case, try adding
> > trace prints in the parsedate() function in htdig/Document.cc (minimal
> > programming skills required, just look at how other debug output is done).
>
> I already tested this but unfortunately I don't get any Dates output when
> running with -vvv
That doesn't add up. If you are indeed running htdig version 3.1.6
with -vvv, then it MUST be showing the Last-Modified headers if your
server is returning them. Are you sure you're running a vanilla 3.1.6
installation, and not some severely modified variant of this, or another
version altogether? Can you show us a complete excerpt of htdig -vvv
output for one file, from one "Retrieval command for ..." message to
the next?
If you're not seeing those either, is it possible you're retrieving
via local_urls? In this case, there's not date parsing involved, as
htdig gets the modtime as a time_t already from the local filesystem.
But you did say you snooped the wire, so I'd guess this isn't the case.
> > If the second date string is fine, it could be a problem related to
> > refetching this info from the database, or some memory leak somewhere.
> > I thought you mentioned that an htdump showed correct, non-zero modtimes.
> > Such a problem would be harder to track down.
>
> Yes, datestamps (seconds since epoch) are in the dumped file.
>
> I'll add debug prints (already did some and always got a date of 0) to the
> files you mentioned and report back.
>
> Could you tell me which subroutine is responsible for parsing the timestamp
> from the local database (I guess reading and parsing that one is the
> problem)?
It's already parsed by the time it gets into the database. It starts out
as a date string from the server, in RFC850 or RFC1123 format, and the
parsedate() function in htdig/Document.cc converts that to a time_t, i.e.
a 32-bit integer representing seconds since Jan 1, 1970, 00:00:00 GMT.
It goes through some encoding and decoding as it gets stored in the
database (see DocumentRef::Serialize() and DocumentRef::Deserialize in
htcommon/DocumentRef.cc), but then it goes through those same routines
when you get the number via htdump. It doesn't get converted back into
a date string until htsearch processes it, using strftime().
--
Gilles R. Detillieux E-mail: <gr...@sc...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
|
|
From: Stefan S. <tal...@in...> - 2002-09-17 19:09:02
|
On 17.9.2002 20:21 Uhr, Geoff Hutchison <ghu...@ws...> wrote:
> At least at the moment, I cannot reproduce this, which is why I have not
> responded sooner. So let me at least ask a few questions which might help:
>
> 1) Do you see actual, formatted dates in htsearch results? (These
> obviously need to do the same access.)
No, but I guess this is due to my templates being set so they don't display
any:
<strong><a href="$&(URL)">$&(TITLE)</a></strong> $(STARSLEFT)<br>
<ul>$(EXCERPT)</ul>
I'll try adding dates there and see what happens.
> 2) Where do you determine that ref->DocTime() is returning 0?
> I ask, in part because htdump is going to access this as well:
> fprintf(fl, "\tm:%d", (int) ref->DocTime());
I added a traceprint to Retreiver.cc inside the Retriever::parse_url(URLRef
&urlRef) routine like so:
--- snip ---
if (ref)
{
//
// We already have an entry for this document in our database.
// This means we can get the document ID and last modification
// time from there.
//
current_id = ref->DocID();
date = ref->DocTime();
if (debug > 2)
{
cout << "\nDOC MATCHED DB!!! \n" << endl;
cout << "DocTime Date is: " << date << endl;
}
--- snap ---
> 3) Are you sure the server is returning a Last-Modified header for files?
Yes, I snooped the wire ;-)
> 4) Does the server properly handle the If-Modified-Since header?
> (To see that this header is sent, check in Document.cc line 525 or so for
> the output sent by htdig.)
It's apache 1.3.26, so I guess it should. But I think htdig only sends the
if-Modified since header if it finds a date for an url in the current
database and as I assume that doesn't happen, so the If-Modified-Since
header never makes it's way out.
Here's an example url from my htdump file to prove a date is in there:
0 u:http://www.CENSORED.com/YADDA.html t:CENSORED a:0
m:873819058 s:280 H: CENSORED h: l:1031854616
L:0 b:1 c:0 g:0 e: n: S: d: A:
--
<http://www.StefanSeiz.com>
Spamto: <bi...@im...>
|
|
From: Geoff H. <ghu...@ws...> - 2002-09-17 18:19:44
|
> I realized htdig always REINDEXING my complete site no matter if a
> document's last modification date matched the servers last mod date or not.
> Tracking down things, I found that ref->DocTime() ALWAYS returns 0 even if
> the given url has a mod date (m:XXXXXXX) value in the database (verified
> with htdump).
> I guess, the problem lies in the conversion of the m:xxxxxx (seconds since
> epoch) value maybe somewhere in mktime.c or such???
At least at the moment, I cannot reproduce this, which is why I have not
responded sooner. So let me at least ask a few questions which might help:
1) Do you see actual, formatted dates in htsearch results? (These
obviously need to do the same access.)
2) Where do you determine that ref->DocTime() is returning 0?
I ask, in part because htdump is going to access this as well:
fprintf(fl, "\tm:%d", (int) ref->DocTime());
3) Are you sure the server is returning a Last-Modified header for files?
4) Does the server properly handle the If-Modified-Since header?
(To see that this header is sent, check in Document.cc line 525 or so for
the output sent by htdig.)
-Geoff
|
|
From: Lachlan A. <lh...@ee...> - 2002-09-16 00:59:30
|
Greetings Gilles, Here is a patch (url-format.patch) to allow the format of external_protocol URLs be <protocol>:<path>, rather than <protcol>://<host>/<path>. It seems to work in the cases I've tested, but I'm not sure how to try it on the test suite, so I hope I haven't broken anything else... Please let me know if it needs more work. The patch is relative to 3.2.0b4-20020616. Let me know if you need it against a more recent snapshot. The HTML table for the description of the output expected from the external transport was also poorly formatted in this snapshot, so I've included another patch (return-field-table.patch) to fix that, if it hasn't already been done. If you apply this patch, do so first. Thanks for your help and advice :) Lachlan On Thu, 12 Sep 2002 02:50, Gilles Detillieux wrote: > Would you be > willing to implement it? If you can provide patches, I > can make sure they make it into the CVS tree. -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Geoff H. <ghu...@us...> - 2002-09-15 07:13:51
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b4: In progress
(mifluz merge essentially finished, contact Geoff for patch to test)
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
SHOWSTOPPERS:
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug)
* Not all htsearch input parameters are handled properly: PR#648. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* If exact isn't specified in the search_algorithms, $(WORDS) is not set
correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can
we fix this?)
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#859)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
NEEDED FEATURES:
* Field-restricted searching.
* Return all URLs.
* Handle noindex_start & noindex_end as string lists.
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient.
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#648.) Also make sure these config
attributes are all documented in defaults.cc, even if they're only set by
input parameters and never in the config file.
* Split attrs.html into categories for faster loading.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
* TODO.html has not been updated for current TODO list and completions.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
(Does Loic's new database code make this work?)
* The code needs a security audit, esp. htsearch
* URL.cc tries to parse malformed URLs (which causes further problems)
(It should probably just set everything to empty) This relates to
PR#348.
|
|
From: Stefan S. <tal...@in...> - 2002-09-12 19:45:15
|
Hi,
sorry for the crosspost to both htdig lists, I wasn't sure which one best
suits.
I ran into a farily nasty problem running htdig on:
Mac OS X Server 10.1.2 Server (5P68) [ Mac OS X 10.1.2 (5P48)
I realized htdig always REINDEXING my complete site no matter if a
document's last modification date matched the servers last mod date or not.
Tracking down things, I found that ref->DocTime() ALWAYS returns 0 even if
the given url has a mod date (m:XXXXXXX) value in the database (verified
with htdump).
I must confess, I am not a programmer and can only find my way in c source
to track down the problem and put some debugging output in there - no more
no less.
I guess, the problem lies in the conversion of the m:xxxxxx (seconds since
epoch) value maybe somewhere in mktime.c or such???
Does anyone have any tips on how to fix this?
Here some parts which might be important to know from config.cache on my
platform:
ac_cv_func_localtime_r=${ac_cv_func_localtime_r=no}
ac_cv_func_timegm=${ac_cv_func_timegm=no}
ac_cv_header_sys_time_h=${ac_cv_header_sys_time_h=yes}
ac_cv_header_time=${ac_cv_header_time=yes}
ac_cv_struct_tm=${ac_cv_struct_tm=time.h}
I'd appreciate any tips as I really can't afford to always kind of start
from scratch when indexing my site as it consumes too much bandwith and time
(lots of large pdf files).
Thanks a lot.
--
<http://www.StefanSeiz.com>
Spamto: <bi...@im...>
|
|
From: Lachlan A. <lh...@ee...> - 2002-09-12 10:16:32
|
On Thu, 12 Sep 2002 02:50, Gilles Detillieux wrote: > According to Lachlan Andrew: > > I was thinking of having a list of > > known services, specifying the number of leading > > slashes: > That sounds like a great idea to me. Would you be > willing to implement it? If you can provide patches, I > can make sure they make it into the CVS tree. I'll gladly give it a go, and let you know how I get on... Cheers, Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Brian W. <bw...@st...> - 2002-09-12 08:42:03
|
At 12:46 12/09/2002, Geoff Hutchison wrote:
>On Thu, 12 Sep 2002, Brian White wrote:
>
> > the "Yes" and "No" responses as expected ) but I have
> > no idea how to propogate my AC_DEFINE down to the
> > C++ code.
>
>Run "autoheader" and it will update the include/htconfig.h.in , which
>becomes htconfig.h which is #included in the code. You should get your
>DEFINE set, and then you can #ifdef, #ifndef, etc.
Cool. That worked.
> > Are there any problems with doing this? Are there
> > any conventions I should be following that I am not?
>
>Personally, if I don't have a system-level file-locking system, I'd
>implement one in user code, say with a .lock file in the same directory
>(if writable) or using TMPDIR and something like
>
>$TMPDIR/path-to-file.lock
>
>for locking /path/to/file.
>
>I've actually felt this might not be a bad addition to the htdig code
>anyway, since the databases really should be locked against multi-write
>until they have granularity.
>
>Does that make any sense?
Well, all except the "until they have granularity" bit -
what does that mean?
Well, I can write something for this if you like.
Issues include
1) What do we do about "dangling" locks? ( where
lock files get left behind due to a crash or some
other kind of bug)
Do it specify a timeout time ( which requires accurate
clocks if it has to work across machines ) and how is
it specified? Or could we use unix PIDs ( which only
work
2) If locks have to be valid across machines, the
TMPDIR isn't going to work unless it is explicitly
put into a common area?
3) Also - do locks need to work between different
programs? What if they have different
permissions - how do we
The more I think about it, actually, the more I think
using TMPDIR is a bad idea. Anything that needs to create
a lock which matters is going to be doing it for a write
file, which generally will give them write access to
the directory.
Either that or locks ALWAYS go in TMPDIR ( which still leaves
you with issue 2)
Brian
One solution is to specify a timeout of some kind - though
that kind of depends on being confident that that all process
that use the lockfile have the same time reference, whether
that is becasue they all run on the same machine, or they
run on machines that are at least close to all having the
same time.
Another solution is to use the process id - but that
only works on unix, and you either need to be running
on the same machine or have the ability to access the
other machines process list. There is also the lurking
danger caused by the fact the PID's are recycled.
Any thoughts?
2) Writing to TMPDIR will probably only work for
processes running on the same machine.
>-Geoff
-------------------------
Brian White
Step Two Designs Pty Ltd
Knowledge Management Consultancy, SGML & XML
Phone: +612-93197901
Web: http://www.steptwo.com.au/
Email: bw...@st...
Content Management Requirements Toolkit
112 CMS requirements, ready to cut-and-paste
|
|
From: Geoff H. <ghu...@ws...> - 2002-09-12 04:32:13
|
On Wed, 11 Sep 2002, Geoff Hutchison wrote: > AFAICT, some of the htword/ destructors seem to set _config to NULL, which > then kills the global HtConfiguration::config pointer. This is mostly a > problem in htsearch currently, where things hit Display and go splat. No, I was just wrong. This seems to work, so I'll clean things up a bit and release another snapshot soon. (probably tomorrow night or Friday). -Geoff |
|
From: Jim C. <gre...@yg...> - 2002-09-12 03:49:33
|
Brian White's bits of Thu, 12 Sep 2002 translated to: >I have been playing with my Adjustable Logging patch. >I now have it working for 3.2.0b3, but I am stuck on If possible, it might be more useful to create a patch against a b4 snapshot. It is usually recommended that people don't use the old b3 release anymore, so the patch might be of limited usefulness in that sense. Jim |
|
From: Geoff H. <ghu...@ws...> - 2002-09-12 02:54:57
|
> Is there any reason we can't move to using the global > > _config = HtConfiguration::config(); In the merge, I'm trying to go on the rule of "least possible changes," which is still quite a lot, even in the ht://Dig code (rather than db/ or htword/). AFAICT, some of the htword/ destructors seem to set _config to NULL, which then kills the global HtConfiguration::config pointer. This is mostly a problem in htsearch currently, where things hit Display and go splat. Then again, I've been hunting after the same problem in Display when the *copy* of _config gets hammered and htsearch segfaults, so who can say it's worse... -Geoff |
|
From: Geoff H. <ghu...@ws...> - 2002-09-12 02:46:23
|
On Thu, 12 Sep 2002, Brian White wrote: > the "Yes" and "No" responses as expected ) but I have > no idea how to propogate my AC_DEFINE down to the > C++ code. Run "autoheader" and it will update the include/htconfig.h.in , which becomes htconfig.h which is #included in the code. You should get your DEFINE set, and then you can #ifdef, #ifndef, etc. > Are there any problems with doing this? Are there > any conventions I should be following that I am not? Personally, if I don't have a system-level file-locking system, I'd implement one in user code, say with a .lock file in the same directory (if writable) or using TMPDIR and something like $TMPDIR/path-to-file.lock for locking /path/to/file. I've actually felt this might not be a bad addition to the htdig code anyway, since the databases really should be locked against multi-write until they have granularity. Does that make any sense? -Geoff |