From: Manuel Z. <ml...@ya...> - 2006-01-29 21:41:39
|
let's try to make an example: could you try to define when two news e.g. one in italian and the other in english are related ? (let's skip the obvious case of proper names) --- Pino Calzo <pi...@ca...> ha scritto: > Ciao Manuel, > > looking at our users behaviour i would say > inter-language (btw - we > know the language per news item). We could even > narrow it down by > category - but that could also hinder possible > interesting results. > > I'm imaging that if something like this is done > there will be clouds > of inter-language related headlines. Some articles > could be > part of multiple clouds. > > So if I see a news headline from a feed i'm > subscribed to, i could on > request also see which articles are related to it > (from other > sources). Another interesting thing is to evaluate > the size of the > clouds and the difference of its size by time (that > way it would be quite > nice to see which clouds are "hot topics" (grow fast > in the last x > hours") and which ones are not anymore.)) > > appended is an example on how the info is in our > database. > > <document> > <id> 125312416 </id> > <title> <b>Berlusconi</b> swears off sex > until election </title> > <info> > http://www.news.com.au/story/0,10117,17979986-401,00.html?from=rss > </info> > <rate> 1138550792 </rate> > <spectags> > <adddate> 2006-01-29 17:10:00 > </adddate> > <source> 6298 </source> > <lang> en </lang> > <ftwords> hpecatchall nkcat1 > nkcat10 nkcat46 nkcat110 nklangen nksource6298 > </ftwords> > <popularity> 0.2929137 > </popularity> > <shortdate> 060129 </shortdate> > <src_name> NEWS.com.au: The > World </src_name> > <src_desc> The top stories from > around the world through Australian eyes. Features > reports from correspondents in Bangkok, Beijing, > Jakarta, London, Los Angeles, New York, Tokyo, > Washington and Wellington and the Australian > Associated Press, Associated Press and... > </src_desc> > </spectags> > <text> ITALIAN Prime Minister Silvio > Berlusconi is famous for his ambitious promises, but > he is unlikely to be called to task if he breaks his > latest pledge: not to have sex before the April 9 > general election. <title> Berlusconi swears off sex > until election </title> > </text> > </document> > > First ID tag is database internal ID. Source is the > ID of Source Name. > Lang is the 2-letter code of the language. FTWords > are keywords we > currently use (a source can belong to multiple > sources). > > > cheers > > Pino Calzo > pi...@ca... > > > > Sunday, January 29, 2006, 2:45:39 PM, you wrote: > > > Actually we are using Isobel for news > classification > > and I think it's feasible to use it to spot news > that > > could be considered 'related'. > > > My first question is: what language are you > interested > > in ? > > > Are you interested in an intra-language relation > or > > inter-language relations ? > > > I'm asking because there is no universal language > tool > > to perform classification or any other kind of > > analysis based on text. > > > Regards > > Manuel > > > > --- Pino Calzo <pi...@ca...> ha scritto: > > >> Ciao Manuel, > >> > >> well - the characteristics of our site is best > >> described as a "river > >> of news". there's a lot of news headlines coming > in > >> across many > >> languages in a short timeframe > >> > >> - 20'118 sources > >> - 30 languages (including arabic etc) > >> - 4'079'951 documents (document is mostly: > >> headline, link and link description) > >> - by hour we add almost 10'000 headlines (and > >> delete 1000) > >> > >> we keep 30 days archive - therefore the > deletions. > >> Most of our users are > >> interested in what's currently happening - and > not > >> really in the > >> archives. > >> > >> So - having said this i guess you understand why > >> I'm talking about a > >> "river of news". The problem we have is that > there > >> are many headlines > >> which might be related, but we have currently no > >> way to see these > >> relations in a machine way. > >> As we classify every source manually in a > category > >> ("channel") we > >> know sometimes that a headline belongs to > "soccer" > >> or "celebrities" > >> because the source itself in general publishes > >> headlines from this > >> category. > >> > >> I would actually be interested to go beyond that > >> and have some kind > >> of "headlines clustering". (e.g. show "related > >> headlines" behind a > >> headline). This could be quite interesting and a > >> world-first (at > >> least i don't know good implementations of > >> something like this in an > >> international way. > >> > >> Is Isobel the right tool for something like > this? > >> Would you see other > >> possibilities/implementations? > >> > >> Grazie e a presto > >> Pino > >> > >> > >> Pino Calzo > >> pi...@ca... > >> > >> > >> > >> Friday, January 27, 2006, 11:17:37 AM, you wrote: > >> > >> > Ciao, > >> > > >> >> I'm wondering if you'd be interested in an > >> >> integration of the > >> >> NewsIsFree news headline database with isobel. > >> > >> > >> > It sounds really interesting ! We would really > >> > appreciate any opportunity to collaborate. > >> > >> > >> > I'm > >> >> not sure if this > >> >> would be possible at all - but the idea is > >> >> interesting. NewsIsFree > >> >> gets thousands of new headlines per hour which > >> >> should be analyzed and > >> >> related to each other. We spider constantly > >> 20'000 > >> >> news sources, > >> >> covering over 20 languages. > >> >> > >> >> As far as I understood this analysis should be > in > === message truncated === ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |