oasisse-bugs Mailing List for OASIS Search Engine

oasisse-bugs — Notification on bugs

You can subscribe to this list here.

2001	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul (2)	_Aug (15)	_Sep (41)	_Oct (17)	_Nov (1)	_Dec (11)
2002	_Jan (10)	_Feb (3)	_Mar (3)	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec

1 2 3 .. 5 > >> (Page 1 of 5)

[Oasisse-bugs] [ oasisse-Bugs-467941 ] raw GIF data in the crawler log

From: <no...@so...> - 2002-03-04 15:44:22

Bugs item #467941, was opened at 2001-10-04 20:42
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=467941&group_id=12104

Category: None
Group: None
Status: Open
Resolution: None
Priority: 3
Submitted By: Alex Babanin (babanin)
Assigned to: Nobody/Anonymous (nobody)
Summary: raw GIF data in the crawler log

Initial Comment:
Getting subj, see below. Did not see before creation of exclusion
file:

D       *       *       *       *
A       *       www.todmaffin.com       *       *  

Still it may be a coincidence.

Cheers
Alex

18:18:08 OASIS_CRAWLER   Connected to www.todmaffin.com!
18:18:08 OASIS_CRAWLER Sending command [GET /speaking/autodealers.gif HTTP/1.0
Proxy-Connection: Keep-Alive
User-Agent: Mozilla/4.04 [en] (X11; I; SunOS 5.6 i86pc)
Pragma: no-cache
HOST: www.todmaffin.com
Accept: */*
Accept-Language: en; ru
Accept-Charset: *


] ...
18:18:08 OASIS_CRAWLER 221 bytes were send of 221
18:18:08 OASIS_CRAWLER Continue reading: 0(+1448) of 11018; attempts: 0
18:18:08 OASIS_CRAWLER Continue reading: 1448(+1448) of 11018; attempts: 0
18:18:09 OASIS_CRAWLER Continue reading: 2896(+1448) of 11018; attempts: 0
18:18:09 OASIS_CRAWLER Continue reading: 4344(+1448) of 11018; attempts: 0
18:18:09 OASIS_CRAWLER Continue reading: 5792(+1448) of 11018; attempts: 0
18:18:09 OASIS_CRAWLER Continue reading: 7240(+1448) of 11018; attempts: 0
...
18:18:09 OASIS_CRAWLER Total recieved: 11018
18:18:09 OASIS_CRAWLER ===== PAGE CONTENT ====
^M
^M
GIF89aX^Bj^B~@^WcIÌ~^?ÛÛÒÐU^[Ç´^O9~Iú~Vû~R^Wä~U~NuÌ¸U§ä5~_í^D~L(ÇêÉ°<^X¡â±J
E^CR
Y~ìð(^D)îR$^A_^Ko~W~]^R~NÐaä^A^N|´õµå~M~D?^Z~La±"fª^GÎ0{ZêOZW~CÆñO^?Llâ=À~Q
¬R=L
'(Â^XÞ| è^_Gÿ±írES¢gfÔ:ee^M9^]^TÜwøÖµ¥P~AÍbW¿n^H&ù^P.TN¬£^]ï~HÇ<¢Å1.Ê^]¡\·,^YÕp
?×*~K£^F~HÂ^X}åZ~BJ¤ÙN&ºdÅ~G/^_^LÕ|´§^]^[ÂmT«Ë~[ðè~F 
ûh~L~S¡ë[)õ~HÊTªr~U¬4Å~Sî~G
Àöµr~VÃÙ 
6~VHË\f~NY~Q8~X.^?^Y9~Fù²f@dÜ0~[^DÌdRD~X~X¢F^Q»H&ª)s~Z×ûÌ1~]¹^Ld~Ln_Ô~\



----------------------------------------------------------------------

>Comment By: Igor Nekrestyanov (nis)
Date: 2002-03-04 18:38

Message:
Logged In: YES 
user_id=91810

I do not think that putting raw data to log file is really 
a big problem. In fact this is exactly expected behavior -
content of page fetched is placed tothe log. 
Probably we can add "pretty print" here but this is
 definetly new feature. 

Moreover, this time your exclusion file permit fetching 
gif files from this site. This can easily be avoided by 
addition of deny rule for gif files from any site.

However, i agree it would be nice to implement support for
"accepted/rejected" mimetypes.


----------------------------------------------------------------------

Comment By: Mikhail Bessonov (mbessonov)
Date: 2001-10-07 00:26

Message:
Logged In: YES 
user_id=82947

I see several different issues here.

1. MIME type returned by the server is totally ignored. In
fact, loads of things can be done as filters for certain
types (e.g., conversion of PDF to text).

2. Detection of binary files even when they are reported as
text due to misconfiguration of a Web server. It is fairly
easy to filter out files like the previously provided
example. This comes hand in hand with charset and language
detection.

I suggest making a separate feature request out the first
issue and leaving the second one as a bug.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=467941&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-522597 ] --enable-isearch-as-lib breaks crawler

From: <no...@so...> - 2002-03-04 15:29:29

Bugs item #522597, was opened at 2002-02-25 22:24
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522597&group_id=12104

Category: system-wide
Group: None
Status: Open
Resolution: None
Priority: 3
Submitted By: Mikhail Bessonov (mbessonov)
>Assigned to: Ekaterina Pavlova (katyapav)
Summary: --enable-isearch-as-lib breaks crawler

Initial Comment:
When --enable-isearch-as-lib=<path> is specified for
configure script, the crawler build fails: 

/usr/bin/gcc -o crawler main.o libcrawler.a \
../common/QueryProtocol/lib-common-qp.a \
../common/HtmlParser/lib-common-parser.a \
../common/profiles/lib-common-profiles.a \
../common/stemmers/lib-common-stem.a \
../common/liboasisse.a  -L/usr/local/lib -lilu-c -lilu\
  -lpthread  -lz -ltucs -lIsearch -lstdc++ -lm \
  -lcrypt -L/usr/local/lib -lrainbow -lbow
/usr/bin/ld: cannot find -lIsearch


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522597&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-522545 ] some libbow headers not found

From: <no...@so...> - 2002-03-04 15:29:05

Bugs item #522545, was opened at 2002-02-25 20:52
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522545&group_id=12104

Category: crawler
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 3
Submitted By: Mikhail Bessonov (mbessonov)
>Assigned to: Igor Nekrestyanov (nis)
Summary: some libbow headers not found

Initial Comment:
The configure script looks for kl.h and friends under
$with_libbow_home/include. The default 'make install'
from libbow, however, copies only libbow.h there. So
either the patch for the libbow makefile should copy
the other headers to $with_libbow_home/include, or the
configure script should look up the source directory of
libbow, just $with_libbow_home is OK.

----------------------------------------------------------------------

>Comment By: Igor Nekrestyanov (nis)
Date: 2002-03-04 18:28

Message:
Logged In: YES 
user_id=91810

I have changed recommended patch (see crawler/README) to 
copy missing header files. Also now it patch Makefile.in
indeed of generated Makefile itself :)

Hopefully this problem is gone now.
Please reopen if not.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522545&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-524079 ] crawler does not recover after segfault

From: <no...@so...> - 2002-02-28 22:43:07

Bugs item #524079, was opened at 2002-02-28 23:43
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=524079&group_id=12104

Category: crawler
Group: None
Status: Open
Resolution: None
Priority: 9
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: crawler does not recover after segfault

Initial Comment:
The new filter subsystem does not survive the second
fork(). Crawler used to fork a new harvester when the
old one died for some reason (e.g., weird HTML in the
downloaded page). Now the last screams in the log are:

Selected [bow_filter] filtering method as defaut
bow_words_read_from_fp: The vocabulary map has already
been created.
22:07:21 OASIS_CRAWLER Parent got signal SIGABRT


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=524079&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-522597 ] --enable-isearch-as-lib breaks crawler

From: <no...@so...> - 2002-02-25 19:24:42

Bugs item #522597, was opened at 2002-02-25 11:24
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522597&group_id=12104

Category: system-wide
Group: None
Status: Open
Resolution: None
Priority: 3
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: --enable-isearch-as-lib breaks crawler

Initial Comment:
When --enable-isearch-as-lib=<path> is specified for
configure script, the crawler build fails: 

/usr/bin/gcc -o crawler main.o libcrawler.a \
../common/QueryProtocol/lib-common-qp.a \
../common/HtmlParser/lib-common-parser.a \
../common/profiles/lib-common-profiles.a \
../common/stemmers/lib-common-stem.a \
../common/liboasisse.a  -L/usr/local/lib -lilu-c -lilu\
  -lpthread  -lz -ltucs -lIsearch -lstdc++ -lm \
  -lcrypt -L/usr/local/lib -lrainbow -lbow
/usr/bin/ld: cannot find -lIsearch


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522597&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-522545 ] some libbow headers not found

From: <no...@so...> - 2002-02-25 17:52:43

Bugs item #522545, was opened at 2002-02-25 09:52
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522545&group_id=12104

Category: crawler
Group: None
Status: Open
Resolution: None
Priority: 3
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: some libbow headers not found

Initial Comment:
The configure script looks for kl.h and friends under
$with_libbow_home/include. The default 'make install'
from libbow, however, copies only libbow.h there. So
either the patch for the libbow makefile should copy
the other headers to $with_libbow_home/include, or the
configure script should look up the source directory of
libbow, just $with_libbow_home is OK.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=522545&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-486088 ] ModificationDate attribute incorrect

From: <no...@so...> - 2002-01-31 15:53:27

Bugs item #486088, was opened at 2001-11-27 09:51
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=486088&group_id=12104

Category: crawler
Group: CRAWLER_1_0
Status: Open
Resolution: None
Priority: 5
Submitted By: Mikhail Bessonov (mbessonov)
>Assigned to: Mikhail Bessonov (mbessonov)
Summary: ModificationDate attribute incorrect

Initial Comment:
The ModificationDate attribute is always set to the
same value, something like '1.10.98'. By the way, what
is the correct format? yyyy-mm-dd?

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=486088&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-467935 ] Failed to resolve hostname www.yahoo.com

From: <no...@so...> - 2002-01-31 15:51:13

Bugs item #467935, was opened at 2001-10-04 09:34
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=467935&group_id=12104

>Category: crawler
Group: None
>Status: Closed
>Resolution: Duplicate
Priority: 9
Submitted By: Alex Babanin (babanin)
Assigned to: Nobody/Anonymous (nobody)
Summary: Failed to resolve hostname www.yahoo.com

Initial Comment:
Hi!

some problem (probably because of timeout)  with resolving hosts.
Getting repeating messages in the log like:

].DERR: 16:34:34 OASIS_CRAWLER Failed to resolve hostname [www.yahoo.com
].DERR: 16:34:34 OASIS_CRAWLER Failed to resolve hostname [www.yahoo.com
].DERR: 16:34:34 OASIS_CRAWLER Failed to resolve hostname [www.yahoo.com



].DERR: 16:43:52 OASIS_CRAWLER Failed to resolve hostname [www.pursuitwatch.com
].DERR: 16:43:52 OASIS_CRAWLER Failed to resolve hostname [www.pursuitwatch.com
].DERR: 16:43:52 OASIS_CRAWLER Failed to resolve hostname [www.pursuitwatch.com
].DERR: 16:44:09 OASIS_CRAWLER Failed to resolve hostname [www.wirelessint.com
].DERR: 16:44:09 OASIS_CRAWLER Failed to resolve hostname [www.wirelessint.com
].DERR: 16:44:09 OASIS_CRAWLER Failed to resolve hostname [www.wirelessint.com
].DERR: 16:44:10 OASIS_CRAWLER Failed to resolve hostname [www.rcrnews.com
].DERR: 16:44:10 OASIS_CRAWLER Failed to resolve hostname [www.rcrnews.com
].DERR: 16:44:10 OASIS_CRAWLER Failed to resolve hostname [www.rcrnews.com
].DERR: 16:44:18 OASIS_CRAWLER Failed to resolve hostname [www.cellzone.com.au
].DERR: 16:44:18 OASIS_CRAWLER Failed to resolve hostname [www.cellzone.com.au
].DERR: 16:44:18 OASIS_CRAWLER Failed to resolve hostname [www.cellzone.com.au
STDERR: 16:47:19 OASIS_CRAWLER Failed to resolve hostname
[www.wirelessadassociation.org].
STDERR: 16:47:19 OASIS_CRAWLER Failed to resolve hostname
[www.wirelessadassociation.org].
STDERR: 16:47:19 OASIS_CRAWLER Failed to resolve hostname
[www.wirelessadassociation.org].
STDERR: 16:59:44 OASIS_CRAWLER Failed to open URLs file
[/var/state/oasis/OASIS-Crawler/state/open/Temp_URLs.Bayer] in
crawler_store_URL_list
STDERR: 16:59:44 OASIS_CRAWLER Failed to open visited_robots_txt file
[/var/state/oasis/OASIS-Crawler/state/robots-txt/Temp_Visited_Robots_txt.Bayer]
 in crawler_store_robots_txt_hash in write mode
STDERR: 17:07:38 OASIS_CRAWLER Failed to resolve hostname [www.tetramou.net].
STDERR: 17:08:53 OASIS_CRAWLER Failed to resolve hostname [www.tetramou.net].
STDERR: 17:10:08 OASIS_CRAWLER Failed to resolve hostname [www.tetramou.net].
...

----------------------------------------------------------------------

>Comment By: Mikhail Bessonov (mbessonov)
Date: 2002-01-31 07:51

Message:
Logged In: YES 
user_id=82947

The problem is another instantiation of the newline (\n) or
CR (\r) in the domain names problem, bug #497614. It is
supposedly fixed, and closed today. All hosts in the bug
description except the last one, www.tetramou.net, end in a
newline. www.tetramou.net really does not exist.

----------------------------------------------------------------------

Comment By: Mikhail Bessonov (mbessonov)
Date: 2001-10-06 13:42

Message:
Logged In: YES 
user_id=82947

gethostbyname() is used to resolve names (which is probably
right due to better portability than <resolv.h> family).
However, the value of errno is not checked in order to see
if a transient or permanent error occured. Keeping an open
TCP connection to a name server is probably a better idea
than using UDP datagrams (as it happens now with default
resolver configurations). sethostent(1) must be called to
switch to TCP, and endhostent() to disconnect.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=467935&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-497614 ] newline in URL causes confusion

From: <no...@so...> - 2002-01-31 14:00:51

Bugs item #497614, was opened at 2001-12-29 07:22
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104

Category: crawler
Group: None
>Status: Closed
Resolution: Fixed
Priority: 7
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Igor Nekrestyanov (nis)
Summary: newline in URL causes confusion

Initial Comment:
When a newline occurs in a hyperink, like
<a href="www.host^Mname.com/index.html">, the newline
causes lots of confusion. The hostname with a newline
inside is looked up in the DNS. Much worse, such a URL
is put into the list of open URLs with a newline
inside. Thus its parts before and after the newline are
treated as TWO different URLs by the code that restores
queues from files. The "second URL" has no protocol
part, is thus invalid and causes a crash (btw, segfault
;-) ).

----------------------------------------------------------------------

Comment By: Igor Nekrestyanov (nis)
Date: 2002-01-04 13:28

Message:
Logged In: YES 
user_id=91810

Hopefully this problem is now fixed by 
truncation broken link during parsing in HtmlParser code.
This should prevent broken url to get into the 
crawler queue's.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104

[Oasisse-bugs] [ oasisse-Feature Requests-510806 ] open URL queue gets too long

From: <no...@so...> - 2002-01-30 17:44:28

Feature Requests item #510806, was opened at 2002-01-30 09:44
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=362104&aid=510806&group_id=12104

Category: crawler
Group: None
Status: Open
Priority: 5
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: open URL queue gets too long

Initial Comment:
When the crawler runs for extended periods of time (as
it normally should), the URL queue gets extremely long.
Most entries have expected scores of 0 or extremely
close to 0, in fact, they can be treated as random
links. They are very unlikely to be ever retrieved.
However, they are manipulated in the same way as the
ones near the top of the queue, which takes CPU cycles
and time.

Is there some reasonable way to truncate the queue of
open URLs?

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=362104&aid=510806&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-490294 ] default ports are shown in URLs

From: <no...@so...> - 2002-01-04 21:29:54

Bugs item #490294, was opened at 2001-12-07 08:55
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

Category: crawler
Group: None
>Status: Closed
>Resolution: Wont Fix
Priority: 5
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: default ports are shown in URLs

Initial Comment:
crawler_normalize_url() does not check if the port is
default for the given protocol (e.g., 80 for http). The
displayed urls look like
http://slashdot.org:80/index.html , which is too
strange.


----------------------------------------------------------------------

>Comment By: Igor Nekrestyanov (nis)
Date: 2002-01-04 13:29

Message:
Logged In: YES 
user_id=91810

Closing as discussed.

----------------------------------------------------------------------

Comment By: Mikhail Bessonov (mbessonov)
Date: 2001-12-20 03:04

Message:
Logged In: YES 
user_id=82947

Are there any places in the code where a normalised URL is
string compared with a non-normalised one? If so, it is
wrong, both should be normalised before comparison.
Normalisation should be _consistent_,
i.e., _always_ normalise or discard the default port.
Discarding is better since the URL looks sane.

----------------------------------------------------------------------

Comment By: Igor Nekrestyanov (nis)
Date: 2001-12-07 09:53

Message:
Logged In: YES 
user_id=91810

This is absolutely correct and was 
done on purpose - otherwise explicitly
specified port 80 will cause identity problems.

It is better to use completely uniform representation to 
store URLs internally. 

If this form of url is too strange for the user then 
this is responsibility of tool that is used to display url
to remove/mask default post. 

Propose to close this as INVALID.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-490294 ] default ports are shown in URLs

From: <no...@so...> - 2002-01-04 21:29:53

Bugs item #490294, was opened at 2001-12-07 08:55
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

Category: crawler
Group: None
Status: Closed
Resolution: Wont Fix
Priority: 5
Submitted By: Mikhail Bessonov (mbessonov)
>Assigned to: Igor Nekrestyanov (nis)
Summary: default ports are shown in URLs

Initial Comment:
crawler_normalize_url() does not check if the port is
default for the given protocol (e.g., 80 for http). The
displayed urls look like
http://slashdot.org:80/index.html , which is too
strange.


----------------------------------------------------------------------

>Comment By: Igor Nekrestyanov (nis)
Date: 2002-01-04 13:29

Message:
Logged In: YES 
user_id=91810

Closing as discussed.

----------------------------------------------------------------------

Comment By: Igor Nekrestyanov (nis)
Date: 2002-01-04 13:29

Message:
Logged In: YES 
user_id=91810

Closing as discussed.

----------------------------------------------------------------------

Comment By: Mikhail Bessonov (mbessonov)
Date: 2001-12-20 03:04

Message:
Logged In: YES 
user_id=82947

Are there any places in the code where a normalised URL is
string compared with a non-normalised one? If so, it is
wrong, both should be normalised before comparison.
Normalisation should be _consistent_,
i.e., _always_ normalise or discard the default port.
Discarding is better since the URL looks sane.

----------------------------------------------------------------------

Comment By: Igor Nekrestyanov (nis)
Date: 2001-12-07 09:53

Message:
Logged In: YES 
user_id=91810

This is absolutely correct and was 
done on purpose - otherwise explicitly
specified port 80 will cause identity problems.

It is better to use completely uniform representation to 
store URLs internally. 

If this form of url is too strange for the user then 
this is responsibility of tool that is used to display url
to remove/mask default post. 

Propose to close this as INVALID.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-497614 ] newline in URL causes confusion

From: <no...@so...> - 2002-01-04 21:28:29

Bugs item #497614, was opened at 2001-12-29 07:22
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104

Category: crawler
Group: None
Status: Open
>Resolution: Fixed
Priority: 7
Submitted By: Mikhail Bessonov (mbessonov)
>Assigned to: Igor Nekrestyanov (nis)
Summary: newline in URL causes confusion

Initial Comment:
When a newline occurs in a hyperink, like
<a href="www.host^Mname.com/index.html">, the newline
causes lots of confusion. The hostname with a newline
inside is looked up in the DNS. Much worse, such a URL
is put into the list of open URLs with a newline
inside. Thus its parts before and after the newline are
treated as TWO different URLs by the code that restores
queues from files. The "second URL" has no protocol
part, is thus invalid and causes a crash (btw, segfault
;-) ).

----------------------------------------------------------------------

>Comment By: Igor Nekrestyanov (nis)
Date: 2002-01-04 13:28

Message:
Logged In: YES 
user_id=91810

Hopefully this problem is now fixed by 
truncation broken link during parsing in HtmlParser code.
This should prevent broken url to get into the 
crawler queue's.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-464775 ] crawler crash on same URL after restart

From: <no...@so...> - 2002-01-01 16:47:43

Bugs item #464775, was opened at 2001-09-25 05:00
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=464775&group_id=12104

Category: crawler
Group: CRAWLER_1_0
>Status: Closed
Resolution: Fixed
Priority: 7
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Mikhail Bessonov (mbessonov)
Summary: crawler crash on same URL after restart

Initial Comment:
The current implementation, when restarted, reads the
open URL queue (i.e., a text file) and fetches
documents from the head of the queue. If a particular
URL causes a crash, it is read again upon restart, even
when it is in the list of visited URLs. This makes
automatic restart of a crashed crawler (e.g., by a
script) impossible, which is unfortunate. A proper
implementation should put the URL in the visited list
(or some other persistent storage) _before_ retrieving
it. Upon restart the visited URLs must be skipped.

----------------------------------------------------------------------

Comment By: Mikhail Bessonov (mbessonov)
Date: 2001-12-04 10:50

Message:
Logged In: YES 
user_id=82947

Supposedly fixed, but not tested yet. The status should be
changed to 'closed' when tests positive.

----------------------------------------------------------------------

Comment By: Mikhail Bessonov (mbessonov)
Date: 2001-09-25 11:56

Message:
Logged In: YES 
user_id=82947

Igor is right, the queues are rewritten after a bunch of
URLs is processed. For this reason it is hard to figure out
which particular URL has caused the crash. This URL must be
clearly reported and then investigated. However, this should
not stop the crawler process. The most convenient way of
running a crawler is nightly. It is too inefficient to loose
the whole night because of several problematic URLs.

----------------------------------------------------------------------

Comment By: Igor Nekrestyanov (nis)
Date: 2001-09-25 05:19

Message:
Logged In: YES 
user_id=91810

Also, please post URL caused crawler failures
as separate bugs. They need to be investigated.

Also, actual problem may be caused by whole crawler state 
at that point then skipping to next url may 
actually be not helpfull.

BTW, as far as ai remember crawler actually saves 
visited/open queues not after evey document visited
- so we always restore from some state in the past 
and problematic document could be not on top of queue!
I guess we should at least store current position 
after every document retrieval but this need more
investigation.

----------------------------------------------------------------------

Comment By: Igor Nekrestyanov (nis)
Date: 2001-09-25 05:15

Message:
Logged In: YES 
user_id=91810

I disagree with proposed solution.
If crash happen not because of crawler bug 
(say because computer was shut down or crawler was simply
killed) then proposed solution (in fact workaround :) 
will result in incosistent state. 

I guess the better approach is to have script 
to do several retries and if failure happens always on the 
same url then start from next url in queue. 
(for this we can extend crawler command line parameters 
to allow start not from first url in state and 
probably we need to have helper utility 
that will report what is the first url in given 
saved crawler state)


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=464775&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-489952 ] crawler exits on sighup

From: <no...@so...> - 2002-01-01 16:44:26

Bugs item #489952, was opened at 2001-12-06 10:56
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104

Category: crawler
Group: CRAWLER_1_0
>Status: Closed
Resolution: Fixed
Priority: 9
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Mikhail Bessonov (mbessonov)
Summary: crawler exits on sighup

Initial Comment:
Crawler exits on sighup from the controlling terminal
even when in the daemon mode, i.e., has nothing to do
with the terminal anymore.

----------------------------------------------------------------------

Comment By: Mikhail Bessonov (mbessonov)
Date: 2002-01-01 08:42

Message:
Logged In: YES 
user_id=82947

the patch works for me; please re-activate if you encounter
a problem

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-489952 ] crawler exits on sighup

From: <no...@so...> - 2002-01-01 16:42:52

Bugs item #489952, was opened at 2001-12-06 10:56
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104

Category: crawler
Group: CRAWLER_1_0
Status: Open
>Resolution: Fixed
Priority: 9
Submitted By: Mikhail Bessonov (mbessonov)
>Assigned to: Mikhail Bessonov (mbessonov)
Summary: crawler exits on sighup

Initial Comment:
Crawler exits on sighup from the controlling terminal
even when in the daemon mode, i.e., has nothing to do
with the terminal anymore.

----------------------------------------------------------------------

>Comment By: Mikhail Bessonov (mbessonov)
Date: 2002-01-01 08:42

Message:
Logged In: YES 
user_id=82947

the patch works for me; please re-activate if you encounter
a problem

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-497614 ] newline in URL causes confusion

From: <no...@so...> - 2001-12-29 15:22:43

Bugs item #497614, was opened at 2001-12-29 07:22
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104

Category: crawler
Group: None
Status: Open
Resolution: None
Priority: 7
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: newline in URL causes confusion

Initial Comment:
When a newline occurs in a hyperink, like
<a href="www.host^Mname.com/index.html">, the newline
causes lots of confusion. The hostname with a newline
inside is looked up in the DNS. Much worse, such a URL
is put into the list of open URLs with a newline
inside. Thus its parts before and after the newline are
treated as TWO different URLs by the code that restores
queues from files. The "second URL" has no protocol
part, is thus invalid and causes a crash (btw, segfault
;-) ).

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=497614&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-490294 ] default ports are shown in URLs

From: <no...@so...> - 2001-12-20 11:04:27

Bugs item #490294, was opened at 2001-12-07 08:55
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

Category: crawler
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: default ports are shown in URLs

Initial Comment:
crawler_normalize_url() does not check if the port is
default for the given protocol (e.g., 80 for http). The
displayed urls look like
http://slashdot.org:80/index.html , which is too
strange.


----------------------------------------------------------------------

>Comment By: Mikhail Bessonov (mbessonov)
Date: 2001-12-20 03:04

Message:
Logged In: YES 
user_id=82947

Are there any places in the code where a normalised URL is
string compared with a non-normalised one? If so, it is
wrong, both should be normalised before comparison.
Normalisation should be _consistent_,
i.e., _always_ normalise or discard the default port.
Discarding is better since the URL looks sane.

----------------------------------------------------------------------

Comment By: Igor Nekrestyanov (nis)
Date: 2001-12-07 09:53

Message:
Logged In: YES 
user_id=91810

This is absolutely correct and was 
done on purpose - otherwise explicitly
specified port 80 will cause identity problems.

It is better to use completely uniform representation to 
store URLs internally. 

If this form of url is too strange for the user then 
this is responsibility of tool that is used to display url
to remove/mask default post. 

Propose to close this as INVALID.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-490030 ] crawler treats mailto: urls as relative

From: <no...@so...> - 2001-12-11 10:51:39

Bugs item #490030, was opened at 2001-12-06 13:48
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490030&group_id=12104

Category: crawler
Group: CRAWLER_1_0
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Mikhail Bessonov (mbessonov)
>Assigned to: Mikhail Bessonov (mbessonov)
Summary: crawler treats mailto: urls as relative

Initial Comment:
mailto:whoever@whereever becomes
http://<basename>mailto:whoever@whereever

----------------------------------------------------------------------

>Comment By: Mikhail Bessonov (mbessonov)
Date: 2001-12-11 02:51

Message:
Logged In: YES 
user_id=82947

Seems to work fine, please re-raise if the bug shows once
again

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490030&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-490294 ] default ports are shown in URLs

From: <no...@so...> - 2001-12-07 17:53:22

Bugs item #490294, was opened at 2001-12-07 08:55
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

Category: crawler
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: default ports are shown in URLs

Initial Comment:
crawler_normalize_url() does not check if the port is
default for the given protocol (e.g., 80 for http). The
displayed urls look like
http://slashdot.org:80/index.html , which is too
strange.


----------------------------------------------------------------------

>Comment By: Igor Nekrestyanov (nis)
Date: 2001-12-07 09:53

Message:
Logged In: YES 
user_id=91810

This is absolutely correct and was 
done on purpose - otherwise explicitly
specified port 80 will cause identity problems.

It is better to use completely uniform representation to 
store URLs internally. 

If this form of url is too strange for the user then 
this is responsibility of tool that is used to display url
to remove/mask default post. 

Propose to close this as INVALID.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-490294 ] default ports are shown in URLs

From: <no...@so...> - 2001-12-07 16:56:22

Bugs item #490294, was opened at 2001-12-07 08:55
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

Category: crawler
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
>Summary: default ports are shown in URLs

Initial Comment:
crawler_normalize_url() does not check if the port is
default for the given protocol (e.g., 80 for http). The
displayed urls look like
http://slashdot.org:80/index.html , which is too
strange.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-490294 ] default ports are shown URLs

From: <no...@so...> - 2001-12-07 16:55:27

Bugs item #490294, was opened at 2001-12-07 08:55
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

Category: crawler
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: default ports are shown URLs

Initial Comment:
crawler_normalize_url() does not check if the port is
default for the given protocol (e.g., 80 for http). The
displayed urls look like
http://slashdot.org:80/index.html , which is too
strange.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490294&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-490030 ] crawler treats mailto: urls as relative

From: <no...@so...> - 2001-12-06 21:48:31

Bugs item #490030, was opened at 2001-12-06 13:48
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490030&group_id=12104

Category: crawler
Group: CRAWLER_1_0
Status: Open
Resolution: None
Priority: 5
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: crawler treats mailto: urls as relative

Initial Comment:
mailto:whoever@whereever becomes
http://<basename>mailto:whoever@whereever

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490030&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-490027 ] crawler: connect() waits forever

From: <no...@so...> - 2001-12-06 21:45:17

Bugs item #490027, was opened at 2001-12-06 13:45
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490027&group_id=12104

Category: crawler
Group: CRAWLER_1_0
Status: Open
Resolution: None
Priority: 7
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: crawler: connect() waits forever

Initial Comment:
A blocking call is used for connect(). This is a bad
idea, since the waiting time is not specified and in
fact equals several minutes. I don't know how to
influence the timeout, haven't seen any appropriate
option in setsockopt(). If anyone knows, please tell
me. The use of non-blocking call and select() with
timeout seems safe.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=490027&group_id=12104

[Oasisse-bugs] [ oasisse-Bugs-489952 ] crawler exits on sighup

From: <no...@so...> - 2001-12-06 18:59:45

Bugs item #489952, was opened at 2001-12-06 10:56
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104

Category: crawler
Group: CRAWLER_1_0
Status: Open
Resolution: None
Priority: 9
Submitted By: Mikhail Bessonov (mbessonov)
Assigned to: Nobody/Anonymous (nobody)
Summary: crawler exits on sighup

Initial Comment:
Crawler exits on sighup from the controlling terminal
even when in the daemon mode, i.e., has nothing to do
with the terminal anymore.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=112104&aid=489952&group_id=12104

Flat | Threaded

1 2 3 .. 5 > >> (Page 1 of 5)

2001	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug (15)	Sep (41)	Oct (17)	Nov (1)	Dec (11)
2002	Jan (10)	Feb (3)	Mar (3)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec