You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
(15) |
Sep
(41) |
Oct
(17) |
Nov
(1) |
Dec
(11) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(10) |
Feb
(3) |
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: <no...@so...> - 2002-03-04 15:44:22
|
Bugs item #467941, was opened at 2001-10-04 20:42 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=467941&group_id=12104 Category: None Group: None Status: Open Resolution: None Priority: 3 Submitted By: Alex Babanin (babanin) Assigned to: Nobody/Anonymous (nobody) Summary: raw GIF data in the crawler log Initial Comment: Getting subj, see below. Did not see before creation of exclusion file: D * * * * A * www.todmaffin.com * * Still it may be a coincidence. Cheers Alex 18:18:08 OASIS_CRAWLER Connected to www.todmaffin.com! 18:18:08 OASIS_CRAWLER Sending command [GET /speaking/autodealers.gif HTTP/1.0 Proxy-Connection: Keep-Alive User-Agent: Mozilla/4.04 [en] (X11; I; SunOS 5.6 i86pc) Pragma: no-cache HOST: www.todmaffin.com Accept: */* Accept-Language: en; ru Accept-Charset: * ] ... 18:18:08 OASIS_CRAWLER 221 bytes were send of 221 18:18:08 OASIS_CRAWLER Continue reading: 0(+1448) of 11018; attempts: 0 18:18:08 OASIS_CRAWLER Continue reading: 1448(+1448) of 11018; attempts: 0 18:18:09 OASIS_CRAWLER Continue reading: 2896(+1448) of 11018; attempts: 0 18:18:09 OASIS_CRAWLER Continue reading: 4344(+1448) of 11018; attempts: 0 18:18:09 OASIS_CRAWLER Continue reading: 5792(+1448) of 11018; attempts: 0 18:18:09 OASIS_CRAWLER Continue reading: 7240(+1448) of 11018; attempts: 0 ... 18:18:09 OASIS_CRAWLER Total recieved: 11018 18:18:09 OASIS_CRAWLER ===== PAGE CONTENT ==== ^M ^M GIF89aX^Bj^B~@^WcIÌ~^?ÛÛÒÐU^[Ç´^O9~Iú~Vû~R^Wä~U~Nu̸U§ä5~_í^D~L(ÇêÉ°<^X¡â±J E^CR Y~ìð(^D)îR$^A_^Ko~W~]^R~NÐaä^A^N|´õµå~M~D?^Z~La±"fª^GÎ0{ZêOZW~CÆñO^?Llâ=À~Q ¬R=L '(Â^XÞ| è^_Gÿ±írES¢gfÔ:ee^M9^]^TÜwøÖµ¥P~AÍbW¿n^H&ù^P.TN¬£^]ï~HÇ<¢Å1.Ê^]¡\·,^YÕp ?×*~K£^F~HÂ^X}åZ~BJ¤ÙN&ºdÅ~G/^_^LÕ|´§^]^[ÂmT«Ë~[ðè~F ûh~L~S¡ë[)õ~HÊTªr~U¬4Å~Sî~G Àöµr~VÃÙ 6~VHË\f~NY~Q8~X.^?^Y9~Fù²f@dÜ0~[^DÌdRD~X~X¢F^Q»H&ª)s~Z×ûÌ1~]¹^Ld~Ln_Ô~\ ---------------------------------------------------------------------- >Comment By: Igor Nekrestyanov (nis) Date: 2002-03-04 18:38 Message: Logged In: YES user_id=91810 I do not think that putting raw data to log file is really a big problem. In fact this is exactly expected behavior - content of page fetched is placed tothe log. Probably we can add "pretty print" here but this is definetly new feature. Moreover, this time your exclusion file permit fetching gif files from this site. This can easily be avoided by addition of deny rule for gif files from any site. However, i agree it would be nice to implement support for "accepted/rejected" mimetypes. ---------------------------------------------------------------------- Comment By: Mikhail Bessonov (mbessonov) Date: 2001-10-07 00:26 Message: Logged In: YES user_id=82947 I see several different issues here. 1. MIME type returned by the server is totally ignored. In fact, loads of things can be done as filters for certain types (e.g., conversion of PDF to text). 2. Detection of binary files even when they are reported as text due to misconfiguration of a Web server. It is fairly easy to filter out files like the previously provided example. This comes hand in hand with charset and language detection. I suggest making a separate feature request out the first issue and leaving the second one as a bug. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=467941&group_id=12104 |
From: <no...@so...> - 2002-03-04 15:29:29
|
Bugs item #522597, was opened at 2002-02-25 22:24 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522597&group_id=12104 Category: system-wide Group: None Status: Open Resolution: None Priority: 3 Submitted By: Mikhail Bessonov (mbessonov) >Assigned to: Ekaterina Pavlova (katyapav) Summary: --enable-isearch-as-lib breaks crawler Initial Comment: When --enable-isearch-as-lib=<path> is specified for configure script, the crawler build fails: /usr/bin/gcc -o crawler main.o libcrawler.a \ ../common/QueryProtocol/lib-common-qp.a \ ../common/HtmlParser/lib-common-parser.a \ ../common/profiles/lib-common-profiles.a \ ../common/stemmers/lib-common-stem.a \ ../common/liboasisse.a -L/usr/local/lib -lilu-c -lilu\ -lpthread -lz -ltucs -lIsearch -lstdc++ -lm \ -lcrypt -L/usr/local/lib -lrainbow -lbow /usr/bin/ld: cannot find -lIsearch ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522597&group_id=12104 |
From: <no...@so...> - 2002-03-04 15:29:05
|
Bugs item #522545, was opened at 2002-02-25 20:52 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522545&group_id=12104 Category: crawler Group: None >Status: Closed >Resolution: Fixed Priority: 3 Submitted By: Mikhail Bessonov (mbessonov) >Assigned to: Igor Nekrestyanov (nis) Summary: some libbow headers not found Initial Comment: The configure script looks for kl.h and friends under $with_libbow_home/include. The default 'make install' from libbow, however, copies only libbow.h there. So either the patch for the libbow makefile should copy the other headers to $with_libbow_home/include, or the configure script should look up the source directory of libbow, just $with_libbow_home is OK. ---------------------------------------------------------------------- >Comment By: Igor Nekrestyanov (nis) Date: 2002-03-04 18:28 Message: Logged In: YES user_id=91810 I have changed recommended patch (see crawler/README) to copy missing header files. Also now it patch Makefile.in indeed of generated Makefile itself :) Hopefully this problem is gone now. Please reopen if not. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522545&group_id=12104 |
From: <no...@so...> - 2002-02-28 22:43:07
|
Bugs item #524079, was opened at 2002-02-28 23:43 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=524079&group_id=12104 Category: crawler Group: None Status: Open Resolution: None Priority: 9 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: crawler does not recover after segfault Initial Comment: The new filter subsystem does not survive the second fork(). Crawler used to fork a new harvester when the old one died for some reason (e.g., weird HTML in the downloaded page). Now the last screams in the log are: Selected [bow_filter] filtering method as defaut bow_words_read_from_fp: The vocabulary map has already been created. 22:07:21 OASIS_CRAWLER Parent got signal SIGABRT ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=524079&group_id=12104 |
From: <no...@so...> - 2002-02-25 19:24:42
|
Bugs item #522597, was opened at 2002-02-25 11:24 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522597&group_id=12104 Category: system-wide Group: None Status: Open Resolution: None Priority: 3 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: --enable-isearch-as-lib breaks crawler Initial Comment: When --enable-isearch-as-lib=<path> is specified for configure script, the crawler build fails: /usr/bin/gcc -o crawler main.o libcrawler.a \ ../common/QueryProtocol/lib-common-qp.a \ ../common/HtmlParser/lib-common-parser.a \ ../common/profiles/lib-common-profiles.a \ ../common/stemmers/lib-common-stem.a \ ../common/liboasisse.a -L/usr/local/lib -lilu-c -lilu\ -lpthread -lz -ltucs -lIsearch -lstdc++ -lm \ -lcrypt -L/usr/local/lib -lrainbow -lbow /usr/bin/ld: cannot find -lIsearch ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522597&group_id=12104 |
From: <no...@so...> - 2002-02-25 17:52:43
|
Bugs item #522545, was opened at 2002-02-25 09:52 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522545&group_id=12104 Category: crawler Group: None Status: Open Resolution: None Priority: 3 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: some libbow headers not found Initial Comment: The configure script looks for kl.h and friends under $with_libbow_home/include. The default 'make install' from libbow, however, copies only libbow.h there. So either the patch for the libbow makefile should copy the other headers to $with_libbow_home/include, or the configure script should look up the source directory of libbow, just $with_libbow_home is OK. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522545&group_id=12104 |
From: <no...@so...> - 2002-01-31 15:53:27
|
Bugs item #486088, was opened at 2001-11-27 09:51 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=486088&group_id=12104 Category: crawler Group: CRAWLER_1_0 Status: Open Resolution: None Priority: 5 Submitted By: Mikhail Bessonov (mbessonov) >Assigned to: Mikhail Bessonov (mbessonov) Summary: ModificationDate attribute incorrect Initial Comment: The ModificationDate attribute is always set to the same value, something like '1.10.98'. By the way, what is the correct format? yyyy-mm-dd? ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=486088&group_id=12104 |
From: <no...@so...> - 2002-01-31 15:51:13
|
Bugs item #467935, was opened at 2001-10-04 09:34 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=467935&group_id=12104 >Category: crawler Group: None >Status: Closed >Resolution: Duplicate Priority: 9 Submitted By: Alex Babanin (babanin) Assigned to: Nobody/Anonymous (nobody) Summary: Failed to resolve hostname www.yahoo.com Initial Comment: Hi! some problem (probably because of timeout) with resolving hosts. Getting repeating messages in the log like: ].DERR: 16:34:34 OASIS_CRAWLER Failed to resolve hostname [www.yahoo.com ].DERR: 16:34:34 OASIS_CRAWLER Failed to resolve hostname [www.yahoo.com ].DERR: 16:34:34 OASIS_CRAWLER Failed to resolve hostname [www.yahoo.com ].DERR: 16:43:52 OASIS_CRAWLER Failed to resolve hostname [www.pursuitwatch.com ].DERR: 16:43:52 OASIS_CRAWLER Failed to resolve hostname [www.pursuitwatch.com ].DERR: 16:43:52 OASIS_CRAWLER Failed to resolve hostname [www.pursuitwatch.com ].DERR: 16:44:09 OASIS_CRAWLER Failed to resolve hostname [www.wirelessint.com ].DERR: 16:44:09 OASIS_CRAWLER Failed to resolve hostname [www.wirelessint.com ].DERR: 16:44:09 OASIS_CRAWLER Failed to resolve hostname [www.wirelessint.com ].DERR: 16:44:10 OASIS_CRAWLER Failed to resolve hostname [www.rcrnews.com ].DERR: 16:44:10 OASIS_CRAWLER Failed to resolve hostname [www.rcrnews.com ].DERR: 16:44:10 OASIS_CRAWLER Failed to resolve hostname [www.rcrnews.com ].DERR: 16:44:18 OASIS_CRAWLER Failed to resolve hostname [www.cellzone.com.au ].DERR: 16:44:18 OASIS_CRAWLER Failed to resolve hostname [www.cellzone.com.au ].DERR: 16:44:18 OASIS_CRAWLER Failed to resolve hostname [www.cellzone.com.au STDERR: 16:47:19 OASIS_CRAWLER Failed to resolve hostname [www.wirelessadassociation.org]. STDERR: 16:47:19 OASIS_CRAWLER Failed to resolve hostname [www.wirelessadassociation.org]. STDERR: 16:47:19 OASIS_CRAWLER Failed to resolve hostname [www.wirelessadassociation.org]. STDERR: 16:59:44 OASIS_CRAWLER Failed to open URLs file [/var/state/oasis/OASIS-Crawler/state/open/Temp_URLs.Bayer] in crawler_store_URL_list STDERR: 16:59:44 OASIS_CRAWLER Failed to open visited_robots_txt file [/var/state/oasis/OASIS-Crawler/state/robots-txt/Temp_Visited_Robots_txt.Bayer] in crawler_store_robots_txt_hash in write mode STDERR: 17:07:38 OASIS_CRAWLER Failed to resolve hostname [www.tetramou.net]. STDERR: 17:08:53 OASIS_CRAWLER Failed to resolve hostname [www.tetramou.net]. STDERR: 17:10:08 OASIS_CRAWLER Failed to resolve hostname [www.tetramou.net]. ... ---------------------------------------------------------------------- >Comment By: Mikhail Bessonov (mbessonov) Date: 2002-01-31 07:51 Message: Logged In: YES user_id=82947 The problem is another instantiation of the newline (\n) or CR (\r) in the domain names problem, bug #497614. It is supposedly fixed, and closed today. All hosts in the bug description except the last one, www.tetramou.net, end in a newline. www.tetramou.net really does not exist. ---------------------------------------------------------------------- Comment By: Mikhail Bessonov (mbessonov) Date: 2001-10-06 13:42 Message: Logged In: YES user_id=82947 gethostbyname() is used to resolve names (which is probably right due to better portability than <resolv.h> family). However, the value of errno is not checked in order to see if a transient or permanent error occured. Keeping an open TCP connection to a name server is probably a better idea than using UDP datagrams (as it happens now with default resolver configurations). sethostent(1) must be called to switch to TCP, and endhostent() to disconnect. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=467935&group_id=12104 |
From: <no...@so...> - 2002-01-31 14:00:51
|
Bugs item #497614, was opened at 2001-12-29 07:22 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104 Category: crawler Group: None >Status: Closed Resolution: Fixed Priority: 7 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Igor Nekrestyanov (nis) Summary: newline in URL causes confusion Initial Comment: When a newline occurs in a hyperink, like <a href="www.host^Mname.com/index.html">, the newline causes lots of confusion. The hostname with a newline inside is looked up in the DNS. Much worse, such a URL is put into the list of open URLs with a newline inside. Thus its parts before and after the newline are treated as TWO different URLs by the code that restores queues from files. The "second URL" has no protocol part, is thus invalid and causes a crash (btw, segfault ;-) ). ---------------------------------------------------------------------- Comment By: Igor Nekrestyanov (nis) Date: 2002-01-04 13:28 Message: Logged In: YES user_id=91810 Hopefully this problem is now fixed by truncation broken link during parsing in HtmlParser code. This should prevent broken url to get into the crawler queue's. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104 |
From: <no...@so...> - 2002-01-30 17:44:28
|
Feature Requests item #510806, was opened at 2002-01-30 09:44 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=362104&aid=510806&group_id=12104 Category: crawler Group: None Status: Open Priority: 5 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: open URL queue gets too long Initial Comment: When the crawler runs for extended periods of time (as it normally should), the URL queue gets extremely long. Most entries have expected scores of 0 or extremely close to 0, in fact, they can be treated as random links. They are very unlikely to be ever retrieved. However, they are manipulated in the same way as the ones near the top of the queue, which takes CPU cycles and time. Is there some reasonable way to truncate the queue of open URLs? ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=362104&aid=510806&group_id=12104 |
From: <no...@so...> - 2002-01-04 21:29:54
|
Bugs item #490294, was opened at 2001-12-07 08:55 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 Category: crawler Group: None >Status: Closed >Resolution: Wont Fix Priority: 5 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: default ports are shown in URLs Initial Comment: crawler_normalize_url() does not check if the port is default for the given protocol (e.g., 80 for http). The displayed urls look like http://slashdot.org:80/index.html , which is too strange. ---------------------------------------------------------------------- >Comment By: Igor Nekrestyanov (nis) Date: 2002-01-04 13:29 Message: Logged In: YES user_id=91810 Closing as discussed. ---------------------------------------------------------------------- Comment By: Mikhail Bessonov (mbessonov) Date: 2001-12-20 03:04 Message: Logged In: YES user_id=82947 Are there any places in the code where a normalised URL is string compared with a non-normalised one? If so, it is wrong, both should be normalised before comparison. Normalisation should be _consistent_, i.e., _always_ normalise or discard the default port. Discarding is better since the URL looks sane. ---------------------------------------------------------------------- Comment By: Igor Nekrestyanov (nis) Date: 2001-12-07 09:53 Message: Logged In: YES user_id=91810 This is absolutely correct and was done on purpose - otherwise explicitly specified port 80 will cause identity problems. It is better to use completely uniform representation to store URLs internally. If this form of url is too strange for the user then this is responsibility of tool that is used to display url to remove/mask default post. Propose to close this as INVALID. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 |
From: <no...@so...> - 2002-01-04 21:29:53
|
Bugs item #490294, was opened at 2001-12-07 08:55 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 Category: crawler Group: None Status: Closed Resolution: Wont Fix Priority: 5 Submitted By: Mikhail Bessonov (mbessonov) >Assigned to: Igor Nekrestyanov (nis) Summary: default ports are shown in URLs Initial Comment: crawler_normalize_url() does not check if the port is default for the given protocol (e.g., 80 for http). The displayed urls look like http://slashdot.org:80/index.html , which is too strange. ---------------------------------------------------------------------- >Comment By: Igor Nekrestyanov (nis) Date: 2002-01-04 13:29 Message: Logged In: YES user_id=91810 Closing as discussed. ---------------------------------------------------------------------- Comment By: Igor Nekrestyanov (nis) Date: 2002-01-04 13:29 Message: Logged In: YES user_id=91810 Closing as discussed. ---------------------------------------------------------------------- Comment By: Mikhail Bessonov (mbessonov) Date: 2001-12-20 03:04 Message: Logged In: YES user_id=82947 Are there any places in the code where a normalised URL is string compared with a non-normalised one? If so, it is wrong, both should be normalised before comparison. Normalisation should be _consistent_, i.e., _always_ normalise or discard the default port. Discarding is better since the URL looks sane. ---------------------------------------------------------------------- Comment By: Igor Nekrestyanov (nis) Date: 2001-12-07 09:53 Message: Logged In: YES user_id=91810 This is absolutely correct and was done on purpose - otherwise explicitly specified port 80 will cause identity problems. It is better to use completely uniform representation to store URLs internally. If this form of url is too strange for the user then this is responsibility of tool that is used to display url to remove/mask default post. Propose to close this as INVALID. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 |
From: <no...@so...> - 2002-01-04 21:28:29
|
Bugs item #497614, was opened at 2001-12-29 07:22 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104 Category: crawler Group: None Status: Open >Resolution: Fixed Priority: 7 Submitted By: Mikhail Bessonov (mbessonov) >Assigned to: Igor Nekrestyanov (nis) Summary: newline in URL causes confusion Initial Comment: When a newline occurs in a hyperink, like <a href="www.host^Mname.com/index.html">, the newline causes lots of confusion. The hostname with a newline inside is looked up in the DNS. Much worse, such a URL is put into the list of open URLs with a newline inside. Thus its parts before and after the newline are treated as TWO different URLs by the code that restores queues from files. The "second URL" has no protocol part, is thus invalid and causes a crash (btw, segfault ;-) ). ---------------------------------------------------------------------- >Comment By: Igor Nekrestyanov (nis) Date: 2002-01-04 13:28 Message: Logged In: YES user_id=91810 Hopefully this problem is now fixed by truncation broken link during parsing in HtmlParser code. This should prevent broken url to get into the crawler queue's. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104 |
From: <no...@so...> - 2002-01-01 16:47:43
|
Bugs item #464775, was opened at 2001-09-25 05:00 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=464775&group_id=12104 Category: crawler Group: CRAWLER_1_0 >Status: Closed Resolution: Fixed Priority: 7 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Mikhail Bessonov (mbessonov) Summary: crawler crash on same URL after restart Initial Comment: The current implementation, when restarted, reads the open URL queue (i.e., a text file) and fetches documents from the head of the queue. If a particular URL causes a crash, it is read again upon restart, even when it is in the list of visited URLs. This makes automatic restart of a crashed crawler (e.g., by a script) impossible, which is unfortunate. A proper implementation should put the URL in the visited list (or some other persistent storage) _before_ retrieving it. Upon restart the visited URLs must be skipped. ---------------------------------------------------------------------- Comment By: Mikhail Bessonov (mbessonov) Date: 2001-12-04 10:50 Message: Logged In: YES user_id=82947 Supposedly fixed, but not tested yet. The status should be changed to 'closed' when tests positive. ---------------------------------------------------------------------- Comment By: Mikhail Bessonov (mbessonov) Date: 2001-09-25 11:56 Message: Logged In: YES user_id=82947 Igor is right, the queues are rewritten after a bunch of URLs is processed. For this reason it is hard to figure out which particular URL has caused the crash. This URL must be clearly reported and then investigated. However, this should not stop the crawler process. The most convenient way of running a crawler is nightly. It is too inefficient to loose the whole night because of several problematic URLs. ---------------------------------------------------------------------- Comment By: Igor Nekrestyanov (nis) Date: 2001-09-25 05:19 Message: Logged In: YES user_id=91810 Also, please post URL caused crawler failures as separate bugs. They need to be investigated. Also, actual problem may be caused by whole crawler state at that point then skipping to next url may actually be not helpfull. BTW, as far as ai remember crawler actually saves visited/open queues not after evey document visited - so we always restore from some state in the past and problematic document could be not on top of queue! I guess we should at least store current position after every document retrieval but this need more investigation. ---------------------------------------------------------------------- Comment By: Igor Nekrestyanov (nis) Date: 2001-09-25 05:15 Message: Logged In: YES user_id=91810 I disagree with proposed solution. If crash happen not because of crawler bug (say because computer was shut down or crawler was simply killed) then proposed solution (in fact workaround :) will result in incosistent state. I guess the better approach is to have script to do several retries and if failure happens always on the same url then start from next url in queue. (for this we can extend crawler command line parameters to allow start not from first url in state and probably we need to have helper utility that will report what is the first url in given saved crawler state) ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=464775&group_id=12104 |
From: <no...@so...> - 2002-01-01 16:44:26
|
Bugs item #489952, was opened at 2001-12-06 10:56 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104 Category: crawler Group: CRAWLER_1_0 >Status: Closed Resolution: Fixed Priority: 9 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Mikhail Bessonov (mbessonov) Summary: crawler exits on sighup Initial Comment: Crawler exits on sighup from the controlling terminal even when in the daemon mode, i.e., has nothing to do with the terminal anymore. ---------------------------------------------------------------------- Comment By: Mikhail Bessonov (mbessonov) Date: 2002-01-01 08:42 Message: Logged In: YES user_id=82947 the patch works for me; please re-activate if you encounter a problem ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104 |
From: <no...@so...> - 2002-01-01 16:42:52
|
Bugs item #489952, was opened at 2001-12-06 10:56 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104 Category: crawler Group: CRAWLER_1_0 Status: Open >Resolution: Fixed Priority: 9 Submitted By: Mikhail Bessonov (mbessonov) >Assigned to: Mikhail Bessonov (mbessonov) Summary: crawler exits on sighup Initial Comment: Crawler exits on sighup from the controlling terminal even when in the daemon mode, i.e., has nothing to do with the terminal anymore. ---------------------------------------------------------------------- >Comment By: Mikhail Bessonov (mbessonov) Date: 2002-01-01 08:42 Message: Logged In: YES user_id=82947 the patch works for me; please re-activate if you encounter a problem ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104 |
From: <no...@so...> - 2001-12-29 15:22:43
|
Bugs item #497614, was opened at 2001-12-29 07:22 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104 Category: crawler Group: None Status: Open Resolution: None Priority: 7 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: newline in URL causes confusion Initial Comment: When a newline occurs in a hyperink, like <a href="www.host^Mname.com/index.html">, the newline causes lots of confusion. The hostname with a newline inside is looked up in the DNS. Much worse, such a URL is put into the list of open URLs with a newline inside. Thus its parts before and after the newline are treated as TWO different URLs by the code that restores queues from files. The "second URL" has no protocol part, is thus invalid and causes a crash (btw, segfault ;-) ). ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104 |
From: <no...@so...> - 2001-12-20 11:04:27
|
Bugs item #490294, was opened at 2001-12-07 08:55 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 Category: crawler Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: default ports are shown in URLs Initial Comment: crawler_normalize_url() does not check if the port is default for the given protocol (e.g., 80 for http). The displayed urls look like http://slashdot.org:80/index.html , which is too strange. ---------------------------------------------------------------------- >Comment By: Mikhail Bessonov (mbessonov) Date: 2001-12-20 03:04 Message: Logged In: YES user_id=82947 Are there any places in the code where a normalised URL is string compared with a non-normalised one? If so, it is wrong, both should be normalised before comparison. Normalisation should be _consistent_, i.e., _always_ normalise or discard the default port. Discarding is better since the URL looks sane. ---------------------------------------------------------------------- Comment By: Igor Nekrestyanov (nis) Date: 2001-12-07 09:53 Message: Logged In: YES user_id=91810 This is absolutely correct and was done on purpose - otherwise explicitly specified port 80 will cause identity problems. It is better to use completely uniform representation to store URLs internally. If this form of url is too strange for the user then this is responsibility of tool that is used to display url to remove/mask default post. Propose to close this as INVALID. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 |
From: <no...@so...> - 2001-12-11 10:51:39
|
Bugs item #490030, was opened at 2001-12-06 13:48 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490030&group_id=12104 Category: crawler Group: CRAWLER_1_0 >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Mikhail Bessonov (mbessonov) >Assigned to: Mikhail Bessonov (mbessonov) Summary: crawler treats mailto: urls as relative Initial Comment: mailto:whoever@whereever becomes http://<basename>mailto:whoever@whereever ---------------------------------------------------------------------- >Comment By: Mikhail Bessonov (mbessonov) Date: 2001-12-11 02:51 Message: Logged In: YES user_id=82947 Seems to work fine, please re-raise if the bug shows once again ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490030&group_id=12104 |
From: <no...@so...> - 2001-12-07 17:53:22
|
Bugs item #490294, was opened at 2001-12-07 08:55 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 Category: crawler Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: default ports are shown in URLs Initial Comment: crawler_normalize_url() does not check if the port is default for the given protocol (e.g., 80 for http). The displayed urls look like http://slashdot.org:80/index.html , which is too strange. ---------------------------------------------------------------------- >Comment By: Igor Nekrestyanov (nis) Date: 2001-12-07 09:53 Message: Logged In: YES user_id=91810 This is absolutely correct and was done on purpose - otherwise explicitly specified port 80 will cause identity problems. It is better to use completely uniform representation to store URLs internally. If this form of url is too strange for the user then this is responsibility of tool that is used to display url to remove/mask default post. Propose to close this as INVALID. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 |
From: <no...@so...> - 2001-12-07 16:56:22
|
Bugs item #490294, was opened at 2001-12-07 08:55 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 Category: crawler Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) >Summary: default ports are shown in URLs Initial Comment: crawler_normalize_url() does not check if the port is default for the given protocol (e.g., 80 for http). The displayed urls look like http://slashdot.org:80/index.html , which is too strange. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 |
From: <no...@so...> - 2001-12-07 16:55:27
|
Bugs item #490294, was opened at 2001-12-07 08:55 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 Category: crawler Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: default ports are shown URLs Initial Comment: crawler_normalize_url() does not check if the port is default for the given protocol (e.g., 80 for http). The displayed urls look like http://slashdot.org:80/index.html , which is too strange. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104 |
From: <no...@so...> - 2001-12-06 21:48:31
|
Bugs item #490030, was opened at 2001-12-06 13:48 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490030&group_id=12104 Category: crawler Group: CRAWLER_1_0 Status: Open Resolution: None Priority: 5 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: crawler treats mailto: urls as relative Initial Comment: mailto:whoever@whereever becomes http://<basename>mailto:whoever@whereever ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490030&group_id=12104 |
From: <no...@so...> - 2001-12-06 21:45:17
|
Bugs item #490027, was opened at 2001-12-06 13:45 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490027&group_id=12104 Category: crawler Group: CRAWLER_1_0 Status: Open Resolution: None Priority: 7 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: crawler: connect() waits forever Initial Comment: A blocking call is used for connect(). This is a bad idea, since the waiting time is not specified and in fact equals several minutes. I don't know how to influence the timeout, haven't seen any appropriate option in setsockopt(). If anyone knows, please tell me. The use of non-blocking call and select() with timeout seems safe. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490027&group_id=12104 |
From: <no...@so...> - 2001-12-06 18:59:45
|
Bugs item #489952, was opened at 2001-12-06 10:56 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104 Category: crawler Group: CRAWLER_1_0 Status: Open Resolution: None Priority: 9 Submitted By: Mikhail Bessonov (mbessonov) Assigned to: Nobody/Anonymous (nobody) Summary: crawler exits on sighup Initial Comment: Crawler exits on sighup from the controlling terminal even when in the daemon mode, i.e., has nothing to do with the terminal anymore. ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104 |