|
From: <kau...@cs...> - 2005-11-01 12:47:07
|
Is it possible to rank/sort search results by relevance? Show first results where search term is in html title, or appears several times in text. (versus those where search term appears once, late in text, or in a link name). Does nutchwax index link names within html files? If there's a link http://www.something.net/storm.gif withing html , could I search for 'storm' and get this image into result list? *Kaisa |
|
From: Kaisa K. <kau...@cs...> - 2005-11-01 13:17:06
|
Sorry, I'll correct myself: If there is an html file http://www.something.net/story.html which contains an inline image with name ...storm.gif could I search for storm and get http://www.something.net/story.html into search results :) On Tue, 1 Nov 2005 kau...@cs... wrote: > Does nutchwax index link names within html files? If there's a link > http://www.something.net/storm.gif withing html , could I search for > 'storm' > and get this image into result list? > > *Kaisa > > > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: stack <st...@ar...> - 2005-11-02 00:19:18
|
kau...@cs... wrote: >Is it possible to rank/sort search results by relevance? Show first >results >where search term is in html title, or appears several times in text. >(versus those where search term appears once, late in text, or >in a link name). > > It should be doing this for you Kaisa. In general, are you not seeing the most significant links showing first in results? I just added a little FAQ on ranking with some notes on how nutch is doing it. I'll repeat the note here: By default, at query time, the following fields are boosted as follows: query.url.boost, 4.0f query.anchor.boost, 2.0f query.title.boost, 1.5f query.host.boost, 2.0f query.phrase.boost, 1.0f From the above, terms found in an URL are scored high with anchor text next, then title. You can change the above boosts by editing your nutch-site.xml but in general, the defaults seem to work well for most collections. Anchor text can make a large contribution to a document ranking score. You can see the anchor text for a page by browsing to the 'explain' then editing the URL to put in place 'anchors.jsp' instead of 'explain.jsp'. >Does nutchwax index link names within html files? If there's a link >http://www.something.net/storm.gif withing html , could I search for >'storm' >and get this image into result list? > > This is an interesting question Kaisa. I just took a look. It doesn't look like it (See below for how I figured this). Do you need this feature? Here's how I took a look see at what was in the a particular nutch segment: % ./bin/nutch segread -fix -nocontent -dump nutch-data/segments/debord2005-11-01-155531/ This dumps out what nutch has per resource. It will list the text it parsed from the document, list of outlinks found in the document, the page hash, etc. I compared what was in nutch to what was in the indexed ARC (I zcat'd the ARC). Yours, St.Ack >*Kaisa > > >------------------------------------------------------- >This SF.Net email is sponsored by the JBoss Inc. >Get Certified Today * Register for a JBoss Training Course >Free Certification Exam for All Training Attendees Through End of 2005 >Visit http://www.jboss.com/services/certification for more information >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > |
|
From: <kau...@cs...> - 2005-11-05 11:49:47
|
Hi and thanks for the addition to faq. Many news sites have a structure which may distort ranking of search results. On these sites each page focuses on one current news item but it also has loads of tiny links to other news, like 'other top stories' , 'news of previous day' or 'stories of this week'. In an indexed archive, you could search for 'earth quake' and find close to top search results a page with headline 'cricket results'.=20 Reason being that the cricket page has one link to earth quake news of previous day. This feature is noticeable when there are few pages with real earth quake reports but lots of other pages having links to them. Link texts should have a low priority in indexing .. probably I can make this happen when I find the correct parameters. kaisa On 11/2/2005, "stack" <st...@ar...> wrote: > It should be doing this for you Kaisa. In general, are you not seeing > the most significant links showing first in results? |
|
From: stack <st...@ar...> - 2005-11-07 20:47:26
|
kau...@cs... wrote: >Hi and thanks for the addition to faq. > >Many news sites have a structure which may distort ranking of search >results. On these sites each page focuses on one current news item but >it also has loads of tiny links to other news, like 'other top >stories' , 'news of previous day' or 'stories of this week'. > > Just to be clear, in the above, you mean that the outlink anchor text says 'other top stories' and 'stories of this week'? >In an indexed archive, you could search for 'earth quake' and find >close to top search results a page with headline 'cricket results'. >Reason being that the cricket page has one link to earth quake news of >previous day. > >This feature is noticeable when there are few pages with real earth quake >reports but lots of other pages having links to them. > > > Yes. Makes sense. >Link texts should have a low priority in indexing .. probably I can make >this happen when I find the correct parameters. > > > You can set the boost on inlink anchor text -- see the just-added FAQ -- but looks like you want to be able to set separately the boost on a documents' outlink anchor text. Looking at the Nutch html parser code, currently the outlink anchor text just gets added to the StringBuffer accumulating all document parsed 'text'; the outlink anchor text is not distingushed in any way from the general text of the document. There is currently no means of making its boost be different from that of the general document 'text'. Should we add such a feature Kaisa? Yours, St.Ack >kaisa > >On 11/2/2005, "stack" <st...@ar...> wrote: > > >>It should be doing this for you Kaisa. In general, are you not seeing >>the most significant links showing first in results? >> >> > > >------------------------------------------------------- >SF.Net email is sponsored by: >Tame your development challenges with Apache's Geronimo App Server. Download >it for free - -and be entered to win a 42" plasma tv or your very own >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > |
|
From: <kau...@cs...> - 2005-11-09 12:43:22
|
On 11/7/2005, "stack" <st...@ar...> wrote: > >Many news sites have a structure which may distort ranking of search > >results. On these sites each page focuses on one current news item but > >it also has loads of tiny links to other news, like 'other top > >stories' , 'news of previous day' or 'stories of this week'. > Just to be clear, in the above, you mean that the outlink anchor text > says 'other top stories' and 'stories of this week'? Hi, I tried to describe pages and links like ------- Page start ------- <h1>English cricket results 9.11.2005</h1> ........ lots of text ..... <h2>Other top stories now</h2> <a href=3D"http://www.tekstia.com/news/top/earthquake+timbuktu/107994"> Earthquake in Timbuktu second time this year</a> <a href=3D"http://www.tekstia.com/news/top/tokyo+stocks+explode/107996"> Tokyo stocks explode</a> 10 further links to different subjects .. ---- Page end ------------- Above the links are <a href=3Durl>text</a> , and both text and url contain words which actually don't belong to the body text of the cricket news article. > >In an indexed archive, you could search for 'earth quake' and find > >close to top search results a page with headline 'cricket results'. > >Reason being that the cricket page has one link to earth quake news of > >previous day. > > > >This feature is noticeable when there are few pages with real earth quake > >reports but lots of other pages having links to them. > > > > > > > Yes. Makes sense. > > You can set the boost on inlink anchor text -- see the just-added FAQ -- > but looks like you want to be able to set separately the boost on a > documents' outlink anchor text. Looking at the Nutch html parser code, > currently the outlink anchor text just gets added to the StringBuffer > accumulating all document parsed 'text'; the outlink anchor text is not > distingushed in any way from the general text of the document. There is > currently no means of making its boost be different from that of the > general document 'text'. >=20 > Should we add such a feature Kaisa? |