[Isobel-users] Re: Re[4]: isobel

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

let's try to make an example:

could you try to define when two news e.g. one in
italian and the other in english are related ?

(let's skip the obvious case of proper names)

--- Pino Calzo <pi...@ca...> ha scritto: 

> Ciao Manuel,
> 
> looking at our users behaviour i would say
> inter-language (btw - we
> know the language per news item). We could even
> narrow it down by
> category - but that could also hinder possible
> interesting results.
> 
> I'm imaging that if something like this is done
> there will be clouds
> of inter-language related headlines. Some articles
> could be
> part of multiple clouds.
> 
> So if I see a news headline from a feed i'm
> subscribed to, i could on
> request also see which articles are related to it
> (from other
> sources). Another interesting thing is to evaluate
> the size of the
> clouds and the difference of its size by time (that
> way it would be quite
> nice to see which clouds are "hot topics" (grow fast
> in the last x
> hours") and which ones are not anymore.))
> 
> appended is an example on how the info is in our
> database.
> 
>  <document>
>            <id> 125312416 </id>
>            <title> <b>Berlusconi</b> swears off sex
> until election </title>
>            <info>
>
http://www.news.com.au/story/0,10117,17979986-401,00.html?from=rss
> </info>
>            <rate> 1138550792 </rate>
>            <spectags>
>                      <adddate> 2006-01-29 17:10:00
> </adddate>
>                      <source> 6298 </source>
>                      <lang> en </lang>
>                      <ftwords> hpecatchall nkcat1
> nkcat10 nkcat46 nkcat110 nklangen nksource6298
> </ftwords>
>                      <popularity> 0.2929137
> </popularity>
>                      <shortdate> 060129 </shortdate>
>                      <src_name> NEWS.com.au: The
> World </src_name>
>                      <src_desc> The top stories from
> around the world through Australian eyes. Features
> reports from correspondents in Bangkok, Beijing,
> Jakarta, London, Los Angeles, New York, Tokyo,
> Washington and Wellington and the Australian
> Associated Press, Associated Press and...
> </src_desc>
>            </spectags>
>            <text> ITALIAN Prime Minister Silvio
> Berlusconi is famous for his ambitious promises, but
> he is unlikely to be called to task if he breaks his
> latest pledge: not to have sex before the April 9
> general election. <title> Berlusconi swears off sex
> until election </title>
>            </text>
> </document>
> 
> First ID tag is database internal ID. Source is the
> ID of Source Name.
> Lang is the 2-letter code of the language. FTWords
> are keywords we
> currently use (a source can belong to multiple
> sources).
> 
> 
> cheers
> 
> Pino Calzo
> pi...@ca...
> 
> 
> 
> Sunday, January 29, 2006, 2:45:39 PM, you wrote:
> 
> > Actually we are using Isobel for news
> classification
> > and I think it's feasible to use it to spot news
> that
> > could be considered 'related'.
> 
> > My first question is: what language are you
> interested
> > in ?
> 
> > Are you interested in an intra-language relation
> or
> > inter-language relations ?
> 
> > I'm asking because there is no universal language
> tool
> > to perform classification or any other kind of
> > analysis based on text.
> 
> > Regards
> > Manuel
> 
> 
> > --- Pino Calzo <pi...@ca...> ha scritto: 
> 
> >> Ciao Manuel,
> >> 
> >>  well - the characteristics of our site is best
> >> described as a "river
> >>  of news". there's a lot of news headlines coming
> in
> >> across many
> >>  languages in a short timeframe
> >> 
> >>  - 20'118 sources
> >>  - 30 languages (including arabic etc)
> >>  - 4'079'951 documents (document is mostly:
> >> headline, link and link description)
> >>  -  by hour we add almost 10'000 headlines (and
> >> delete 1000)
> >> 
> >>  we keep 30 days archive - therefore the
> deletions.
> >> Most of our users are
> >>  interested in what's currently happening - and
> not
> >> really in the
> >>  archives.
> >> 
> >>  So - having said this i guess you understand why
> >> I'm talking about a
> >>  "river of news". The problem we have is that
> there
> >> are many headlines
> >>  which might be related, but we have currently no
> >> way to see these
> >>  relations in a machine way.
> >>  As we classify every source manually in a
> category
> >> ("channel") we
> >>  know sometimes that a headline belongs to
> "soccer"
> >> or "celebrities"
> >>  because the source itself in general publishes
> >> headlines from this
> >>  category.
> >> 
> >>  I would actually be interested to go beyond that
> >> and have some kind
> >>  of "headlines clustering". (e.g. show "related
> >> headlines" behind a
> >>  headline). This could be quite interesting and a
> >> world-first (at
> >>  least i don't know good implementations of
> >> something like this in an
> >>  international way.
> >> 
> >>  Is Isobel the right tool for something like
> this?
> >> Would you see other
> >>  possibilities/implementations?
> >> 
> >> Grazie e a presto
> >> Pino
> >> 
> >> 
> >> Pino Calzo
> >> pi...@ca...
> >> 
> >> 
> >> 
> >> Friday, January 27, 2006, 11:17:37 AM, you wrote:
> >> 
> >> >  Ciao,
> >> >  
> >> >> I'm wondering if you'd be interested in an
> >> >> integration of the
> >> >> NewsIsFree news headline database with isobel.
> >> 
> >> 
> >> > It sounds really interesting ! We would really
> >> > appreciate any opportunity to collaborate.
> >> 
> >> 
> >> >  I'm
> >> >> not sure if this
> >> >> would be possible at all - but the idea is
> >> >> interesting. NewsIsFree
> >> >> gets thousands of new headlines per hour which
> >> >> should be analyzed and
> >> >> related to each other. We spider constantly
> >> 20'000
> >> >> news sources,
> >> >> covering over 20 languages.
> >> >> 
> >> >> As far as I understood this analysis should be
> in
> 
=== message truncated ===

___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it