|
From: alexis a. <alx...@ya...> - 2006-03-30 03:48:00
|
Hi, I encountered the problem while trying to search WERA using Chinese letters as my criteria. The search result will always contain a WMV file. Initially, I thought Nutch is indexing wmv files which Stack clarified that it does not. Stack suggested that I check the anchors.jsp to check if there are anchors to that file. However,The anchors.jsp of the file does not have anything in it. I consult the crawl.log file and found these lines: 2006-03-18T04:10:33.604Z 200 1056898 http://info.channelnewsasia.com/rovingdv/boundaries/shthong.wmv LE http://www.channelnewsasia.com/boundaries/videos.htm text/plain #030 20060318041033072+335 AAYVISCYYWBMBZWKBSFSB7Z6XXAKWC2F 3t 2006-03-18T04:10:36.251Z 200 2700490 http://info.channelnewsasia.com/rovingdv/boundaries/asha.wmv LE http://www.channelnewsasia.com/boundaries/videos.htm text/plain #008 20060318041034946+792 CB44IW2LSWLMD6AWIYWOXZFK3HZ4AABU - 2006-03-18T04:10:38.954Z 200 1393066 http://info.channelnewsasia.com/rovingdv/boundaries/drtan.wmv LE http://www.channelnewsasia.com/boundaries/videos.htm text/plain #001 20060318041038254+436 INDK5OOERVNJXO464GLIIT2EPUDIOXZP - 2006-03-18T04:10:41.199Z 200 891220 http://info.channelnewsasia.com/rovingdv/boundaries/chiam.wmv LE http://www.channelnewsasia.com/boundaries/videos.htm text/plain #042 20060318041040700+334 5T3Q2FSH2BSOCI7WSUZKYBYZ6RXLF74H - When I look at the html page that contains the file, I found these codes for one of the files: <object id="MediaPlayer" classid="CLSID:22d6f312-b0f6-11d0-94ab-0080c74c7e95" codebase="http://activex.microsoft.com/activex/controls/mplayer/en/nsm p2inf.cab#Version=5,1,52,701" standby="Loading Microsoft Windows Media Player components..." type="application/x-oleobject" width="200" height="190" border="0" vspace="0" hspace="0" align="top"> <param name="FileName" value="http://info.channelnewsasia.com/rovingdv/boundaries/chiam.wmv"> <param name="AnimationatStart" value="true"> <param name="TransparentatStart" value="true"> <param name="AutoStart" value="false"> <param name="ShowControls" value="1"> <embed type="application/x-mplayer2" pluginspage="http://www.microsoft.com/windows95/downloads/contents/wur ecommended/s_wufeatured/mediaplayer/default.asp" showcontrols=1 width=200 height=190 src="http://info.channelnewsasia.com/rovingdv/boundaries/chiam.wmv" border="0" vspace="0" hspace="0" align="top"> </embed> </object> It seems to me that Heritrix was not able to categorized the said file correctly that is why it is indexed. However, Stack mentioned about nutch doing some more test to check if the file is indeed a text/html. I hope you guys can check double check my problem. Best Regards, Alexis Artes --------------------------------- New Yahoo! Messenger with Voice. Call regular phones from your PC and save big. |