|
From: Neal C. <nc...@me...> - 2004-02-06 19:00:28
|
Just run through your comments, thanks.. I originally commented out the version number and have seperated the indexes out. The error is still occurring :( >>Attempt to free unreferenced scalar at ./indexer.pl line 253 >>Segmentation fault ls -la gives the file as 264 - d'ya reckon this is the problem or a red-herring? can I get further debug info for you? cheers Neal -----Original Message----- From: spr...@li... [mailto:spr...@li...]On Behalf Of Eric Anderson Sent: 06 February 2004 2:06 PM To: Neal Chant Cc: spr...@li...; MSupport Subject: Re: [Sprawler-general] Error - "Unknown char % in body" Neal Chant wrote: > Hi Eric, > > Thanks for the reply. > Yes, it gives the same error for more than one file, originally tried with > 2400 files. > My test .htm is attached - very simple created in dreamweaver. Ok - I've run it through my test indexer, and it works ok, so it must be something simple :) One thing I had to do, is edit the $version line in the indexer, since it looks like cvs mangled it. Most likely I'll have to just remove that line, but here's what it looks like now (line 42 in indexer.pl): $version = qq~$Revision: 1.2\$ $Name\$ ~; Most likely I'll have to remove it, but you can replace it for now, until rev 1.1 of sprawler-lite comes out. >>Seems to have a problem going through the htm file "body Unknown char % in >>body: 0" I forgot to mention, that this isn't an error - it's status info. It means "I found x percentage of unkown (non roman) characters in the body" - it helps with language detection. The zero means it found all roman/english chars (0% of chars were unknown). >> >>index path: /data2/IT/CONTRACTS/ >>document paths: /data2/IT/CONTRACTS/ [..snip..] Ok, it's a good idea to separate the index store area, from the indexable data area, so I would do something like: index path: /data2/IT/CONTRACTS/ document paths: /data2/IT/CONTRACTS/INDEXES/ and just make sure you have made the directory for the INDEXES and that it is of course writable by you. >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.czech.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.danish.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.dutch.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.english.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.french.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.german.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.hungarian.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.italian.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.norwegian.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.polish.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.portugese.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.spanish.txt >>Loading stopwords list from /data2/IT/CONTRACTS/stopwords.turkish.txt >>Begin indexing documents >>One # = 0 documents >>0% 50% 100% >>[Indexing (1/1) /data2/IT/CONTRACTS/Untitled-2.htm at 1075807682 >>Title: test htm >>Filesize: 264 Here's something strange - my indexer reports the Filesize as 251. Where'd those other 13 bytes go? If I do an ls -al on the file: -rw-r--r-- 1 anderson wheel 251 Feb 6 07:48 Untitled-2.htm I see that it truly is 251 bytes. What does yours say? >>Has 8 words - 8 total document words >>checking words in document and removing stopwords >>Unknown char % in body: 0 >>Language Selection: unknown ->> 0 / Charratio: 0 - Reason: () stage 3 >>Attempt to free unreferenced scalar at ./indexer.pl line 253 >>Segmentation fault Here's what it should look like: index path: /indexes/ document paths: /tmp/sprawler/ url locations: doc1/ reindex interval (mins): 1440 indexable extensions: html htm known languages: czech danish dutch english french german hungarian italian norwegian polish portugese spanish turkish Building index list.. /usr/bin/find /tmp/sprawler/ -iname '*.html' -print -fstype local -type f /usr/bin/find /tmp/sprawler/ -iname '*.htm' -print -fstype local -type f Successfully added 1 documents to queue Loading stopwords list from /indexes/stopwords.czech.txt Loading stopwords list from /indexes/stopwords.danish.txt Loading stopwords list from /indexes/stopwords.dutch.txt Loading stopwords list from /indexes/stopwords.english.txt Loading stopwords list from /indexes/stopwords.french.txt Loading stopwords list from /indexes/stopwords.german.txt Loading stopwords list from /indexes/stopwords.hungarian.txt Loading stopwords list from /indexes/stopwords.italian.txt Loading stopwords list from /indexes/stopwords.norwegian.txt Loading stopwords list from /indexes/stopwords.polish.txt Loading stopwords list from /indexes/stopwords.portugese.txt Loading stopwords list from /indexes/stopwords.spanish.txt Loading stopwords list from /indexes/stopwords.turkish.txt Begin indexing documents One # = 0 documents 0% 50% 100% [Indexing (1/1) /tmp/sprawler/Untitled-2.htm at 1076076065 Title: test htm Filesize: 251 Has 8 words - 8 total document words checking words in document and removing stopwords Unknown char % in body: 0 Language Selection: unknown ->> 0 / Charratio: 0 - Reason: () stage 3 #(9 total words to be saved) Flushing word data to disk... Done with /indexes/doc.db-new file Making directory: /indexes/words-new/ .. DONE! -------------------------------------------- ] DONE! Total documents indexed: 1 Total bytes indexed: 251 Total words indexed: 8 Total words in index: -1 Eric -- ------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Today is the tomorrow you worried about yesterday. ------------------------------------------------------------------ ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Sprawler-general mailing list Spr...@li... https://lists.sourceforge.net/lists/listinfo/sprawler-general |