Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.


#122 db.urls causing htdig to run forever

cannot reproduce
htdig (103)


I activated the generation of the db.urls file. Also the script I'm using is calling htdig with the "-a"
option which will cause the tool to generate *.work files
while running. It does this ... exept for db.urls.

So on my kernel 2.2.18 is will run into the 2GB file size
problem, because the script did not delete this
file before calling htdig. I now added this. But I would like
to see that also for this file a will be created
so that I can move it into the place of the old file after
anything is finished.




1 2 > >> (Page 1 of 2)
  • Logged In: YES

    htdig -a doesn't add a .work suffix to db.urls because there isn't any compelling reason to do so. The file isn't used
    by any other program, so there's no need to maintain separate .work and non-work versions of the file. However,
    when you run htdig without -i, it will always append to this file, so if you're concerned about overflowing it, then you
    should propably process it to remove duplicates, or just move or remove it, between indexing runs.

    It wouldn't make sense for htdig -a to make a, and for to replace the older db.urls with the file, because the latter file is incomplete, as it only contains the URLs that were seen in updated
    documents, and not those seen in documents still in the index but not updated. I think you need to decide what you
    want to do with this file and do it in your indexing script. I don't think the create_urls_list feature is commonly used in
    a production environment like this, as it's more for debugging purposes, as far as I know.

    In any case, I don't see why this would cause htdig to run forever. htdig doesn't read this file, it only writes to it, so if
    it overflows, htdig would get write errors (which it would ignore) and move on. Maybe you have another problem
    that's causing htdig not to stop when you want it to. Running htdig with a -v option will show you the URLs it's
    actually fetching and indexing, as opposed to the ones it merely sees in documents (which is what db.urls

    As this seems to be more of a configuration problem than a bug, any followups should be on the htdig-general
    mailing list rather than the bug database.

    • assigned_to: nobody --> grdetil
    • status: open --> closed
  • Logged In: YES

    I would like you to check before just guessing that there is no
    bug but a missinterpreting of the functions and so I don't
    understand why you can close this bug report so easy and

    Try it, hang it, remove the bug.

    I'm now deleting the file before the new run of htdig. And I get
    any Urls into this file because our site ist not sending the
    "LATS_MODIFIED_SINCE"-Header. So deleting it is not a
    problem for me.

    So once again: Test it before just guessing that it's only my



    • milestone: --> 103281
    • status: closed --> pending
  • Logged In: YES

    Please try to understand that with the little amount of time
    that the
    few of us developers have to debug and develop ht://Dig, we
    to prioritize where that time is spent, and can't go on wild
    goose chases
    every time a user suspects that a problem they've run into
    is caused
    by a bug rather than a misconfiguration. What that means is
    that when
    we get a clearly worded bug report that makes a compelling
    case for
    a problem being fairly solidly pinned down to a bug in the
    code, those
    are the reports we will tend to act on first. Next priority
    will be given to
    those reports which, while not clearly pinning down a bug,
    will at least
    report what seems plausibly to be a bug, or that fairly
    clearly rule out
    a configuration problem.

    When we get the impression from a bug report that it's more
    likely to be
    a configuration problem than a clear bug, we will usually
    ask the person
    reporting the bug to rule out any configuration problems
    before taking
    valuable time to look for a bug. We'll also ask for more
    details about the
    problem so that we have something solid to go on. Perhaps I
    was hasty
    in closing the bug report, so I've now marked it as pending,
    while we
    await some solid data, but there is insufficient information
    in this report
    to justify even attempting to reproduce the problem at this

    If you want us to take time to track down what you seem
    convinced is
    a bug and not a misconfiguration, you will have to convince
    us of this,
    by providing some sort of evidence that you have actually
    ruled out
    a misconfiguration, plus you will have to give us enough
    so that we will have any hope at all of actually reproducing
    the error.

    Given that there are tons of possible configuration problems
    that could
    cause htdig to go on longer than expected, can you provide
    information at all that would suggest you've actually looked
    into this
    possibility? Have you:

    1) Looked at any htdig -v output to see that htdig is
    indexing only
    URLs you want indexed, and not going on to URLs you don't

    2) Determined clearly that the 2 GB file size limit is
    indeed the cause of
    the problem. Your bug report didn't indicate the exact size
    of your db.urls
    file, before or after htdig seemed to run "forever"? Does
    htdig consistently
    end its run when you think it should when db.urls is small
    or non-existent,
    and consistently fail or run on when db.urls hits the 2 GB

    3) Determined what you mean by "forever"? How many documents
    you indexing, and how did you determine that htdig is
    clearly running
    far longer than is reasonable for that number of documents?

    4) Determined how the create_urls_list attribute and/or the
    -a option to
    htdig contribute to the problem? Assuming you've done the
    tests in (2)
    above, do the results change if you don't use the -a option?
    Does htdig
    run to completion in the expected amount of time when you
    disable the
    create_urls_list attribute?

    In other words, to throw your own words back at you, test it
    before just
    guessing that it's only our fault. The evidence you've given
    us so far
    doesn't give any indication that it's likely to be a bug
    rather that a
    configuration problem. The onus is on you to rule out the

    If you've looked into these possibilities, and the evidence
    then points to
    a bug, then we would, at the very least, need to know which
    of htdig you're running, so we can try to reproduce the
    problem on our
    machines. I don't personally have any that run the 2.2.18
    kernel, but I
    do have two Red Hat Linux 6.2 systems, one with a 2.2.14
    kernel and
    one with 2.2.19. If the problem is specific to the 2.2.18
    kernel, though,
    and doesn't happen on other kernel versions, then the answer
    would be
    pretty simple - change your kernel version. However, for
    purposes of
    trying to reproduce an alleged bug in ht://Dig, it would be
    ridiculous to
    attempt this without even knowing which version you're
    running (and
    making sure that the latest version still fails the same

    • status: pending --> open
  • Logged In: YES


    Answer to Question 1) I did not run htdig with the -v option but
    the report tells me that anything with indexing is ok:

    htmerge: Total word count: 259292
    htmerge: Total documents: 28958
    htmerge: Total size of documents (in K): 402742

    The amount of documents is matching with the one I expect.

    Answer to question 2) I'm investigating now to deliver the
    needed informations. But this will take some days because I
    will not fill the db.urls file with nonsens data to reproduce the
    bug earlier.

    This for Q2 and Q3) Then the db.urls file is small or non
    existing ht://Dig does its job and ends after indexing our site
    for about 3,5 hours. Then the db.urls file is to big, ht://Dig tries
    to deliver the data but can not append it and hangs for days.
    So each day a new ht://Dig is started by cron and also hangs
    and consumes processor time. Also take a look at the answers
    above for amount of documents and time for indexing.

    Then ht://Digs variable create_urls_list was deactivated it did
    it's job. I started to use it some months ago to get a list of all
    external URLs.

    And at last: The ht://Dig version we're using is 3.1.6.

    P.s.: Closed again?



  • Logged In: YES


    Ahh, not closed again. The information only at the bootom of
    the page was unclear.


  • Logged In: YES

    Just to keep you updated ...

    I'm following on to reproduce the bug. But it's very hard
    because as bigger the db.urls file grows as more the ht://Dig
    process somethere starts to fail indexing the server:

    ---- snip ----htdig: Errors to take note of:
    Unknown host:
    Unable to contact server:
    ---- snap ----

    I everyday get a cron report with between 600k an 1000k mail
    size. That means that the db.urls file only grows slow:

    ---- snip ----
    1844873016 Jul 25 04:56 db.urls
    1834000864 Jul 24 05:26 db.urls
    1813749360 Jul 23 05:11 db.urls
    1798281658 Jul 22 05:10 db.urls
    1782762961 Jul 21 05:04 db.urls
    1129240811 Jun 29 04:46 db.urls
    1121530624 Jun 28 07:01 db.urls
    1058434095 Jun 27 04:44 db.urls
    320350246 Jun 9 07:30 db.urls
    241763669 Jun 8 07:38 db.urls
    163264744 Jun 7 07:34 db.urls
    84943567 Jun 6 07:40 db.urls
    6767328 Jun 5 04:46 db.urls
    ---- snap ----



  • Logged In: YES


    Faster then expected the result has shown up!

    Yesterday ht://Dig filled up db.urls up to the 2GB file size and
    now is blocked. This morning a second ht://Dig process was

    ---- snip ----
    31123 ? S 0:00 /bin/sh -c nice -n19
    >/var/log/dig.log 2>/var/log/dig.err
    31124 ? SN 0:00 sh
    31153 ? RN 1293:59 \_
    /home/a/ -s -a -c
    21771 ? S 0:00 /bin/sh -c nice -n19
    >/var/log/dig.log 2>/var/log/dig.err
    21772 ? SN 0:00 sh
    21801 ? RN 62:45 \_
    /home/a/ -s -a -c
    ---- snap ----

    An a listin of the db dir gives the folowing result:

    ---- snip ----
    -rw-r--r-- 1 amiganew amiganew 117592064 Jul 29 08:53
    -rw-r--r-- 1 amiganew amiganew 47990784 Jul 30 05:43
    -rw-r--r-- 1 amiganew amiganew 4451328 Jul 29 08:53
    -rw-r--r-- 1 amiganew amiganew 2147483645 Jul 31 06:44
    -rw-r--r-- 1 amiganew amiganew 189888832 Jul 29 08:50
    -rw-r--r-- 1 amiganew amiganew 88429122 Jul 31 04:30
    -rw-r--r-- 1 amiganew amiganew 163634176 Jul 29 08:50
    ---- snap ----

    I now will delete the db.urls file to see if the processes
    continue to run again. But as you can see the db.urls file is
    exactly 2 GB big.



1 2 > >> (Page 1 of 2)