Thread: [Aperture-devel] aperture crawling server with configurable datasources - and windows share crawlin

Brought to you by: cfmfluit, leo_sauermann, mylka, reuschling

aperture-devel

[Aperture-devel] aperture crawling server with configurable datasources - and windows share crawling - SOLR guys?

From: Leo S. <leo...@df...> - 2009-12-16 15:17:08

Hi Aperturians,

in the organik-project.eu, and for other things, it would be awesome to
have an "aperture crawling server" that is an installable war file that
just crawls some configured datasources and then does something with the
crawled rdf.

for example, we have this company that has a fileshare on a windows
server. Now I would like to install "aperture crawling server" on some
machine as WAR file, instruct it using a web-interface to crawl a
datasource "windows share" (and maybe some internal websites using the
webcrawler and maybe some newsletter using imap) and off it goes to do this.
Then I would like to configure the "SOLR crawling handler" or the
"drupal" crawling handler to tell the aperture crawling server what to
do with the RDF.

I know that the people having done the SOLR integration of Aperture
probably did something exactly like this - do we have some open source
code now available for it?

have you written an aperture-based crawling server that you can share
with me?
(doesn't have to be completly open source, but only looking at the code
and the architecture could teach me a lot)

would you think this is a rocking idea and join a new subproject in
Aperture for this?
(just say yes and then we go on from this, forming a group with
requirements, etc)

has anyone written a windows fileshare-crawler? What java libraries are
there to crawl windows/samba shares?
(that would be awesome)

best
Leo

-- 
_____________________________________________________
Dr. Leo Sauermann       http://www.dfki.de/~sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +43 6991 gnowsis
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo...@df...

Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
_____________________________________________________

Re: [Aperture-devel] aperture crawling server with configurable datasources - and windows share crawling - SOLR guys?

From: Darren G. <da...@on...> - 2009-12-16 18:57:45

I'm still playing around, but in my crawler handler objectNew method I
have something like this to pass along the text/metadata to Solr.


            try {

            	String text =
object.getMetadata().getString(NIE.plainTextContent);

                server = new CommonsHttpSolrServer(url);
                SolrInputDocument doc1 = new SolrInputDocument();
                UUID uuid = UUID.randomUUID();
                doc1.addField("id", uuid.toString(), 1.0f);
                String title =
object.getMetadata().getString(NIE.title);
                doc1.addField("name", title, 1.0f);
                doc1.addField("text", text);
		// Add more fields, but they either have to be in solr's schema.xml or
dynamic fields
                Collection<SolrInputDocument> docs = new
ArrayList<SolrInputDocument>();
                docs.add(doc1);
                server.add(docs);
                server.commit();
            } catch (SolrServerException ex) {

Logger.getLogger(UserDefinedStoreCrawlerHandler.class.getName()).log(Level.SEVERE, null, ex);
            } catch (IOException ex) {

Logger.getLogger(UserDefinedStoreCrawlerHandler.class.getName()).log(Level.SEVERE, null, ex);
            }

On Wed, 2009-12-16 at 16:16 +0100, Leo Sauermann wrote:
> Hi Aperturians,
> 
> in the organik-project.eu, and for other things, it would be awesome to
> have an "aperture crawling server" that is an installable war file that
> just crawls some configured datasources and then does something with the
> crawled rdf.
> 
> for example, we have this company that has a fileshare on a windows
> server. Now I would like to install "aperture crawling server" on some
> machine as WAR file, instruct it using a web-interface to crawl a
> datasource "windows share" (and maybe some internal websites using the
> webcrawler and maybe some newsletter using imap) and off it goes to do this.
> Then I would like to configure the "SOLR crawling handler" or the
> "drupal" crawling handler to tell the aperture crawling server what to
> do with the RDF.
> 
> I know that the people having done the SOLR integration of Aperture
> probably did something exactly like this - do we have some open source
> code now available for it?
> 
> have you written an aperture-based crawling server that you can share
> with me?
> (doesn't have to be completly open source, but only looking at the code
> and the architecture could teach me a lot)
> 
> would you think this is a rocking idea and join a new subproject in
> Aperture for this?
> (just say yes and then we go on from this, forming a group with
> requirements, etc)
> 
> has anyone written a windows fileshare-crawler? What java libraries are
> there to crawl windows/samba shares?
> (that would be awesome)
> 
> best
> Leo
>

Re: [Aperture-devel] aperture crawling server with configurable datasources - and windows share crawling - SOLR guys?

From: Antoni M. <ant...@gm...> - 2009-12-21 11:11:17

Leo, Aperturians

The idea is great, if there is more interest in using Aperture in SOLR
then we could expand in this direction. What is needed is feedback what
data sources would you like to use with SOLR (or use already). If there
is need, we could think about expanding in that direction.

Some ideas from the top of my head:

- elevate the MimeSubCrawler into an mbox subcrawler, that would crawl
plain text mailing list archives properly, or make proper mbox crawling
results appear in the FileSystemCrawler output
- extend the output of flickr/delicious/bibsonomy subcrawlers, let them
extract photo comments, or publication abstracts
- think about some SambaCrawler or FTPCrawler or pick up the abandoned
WebdavCrawler that that would crawl remote folders without having to
mount them locally first
- LdapCrawler? XMLDbCrawler?

sky is the limit

Antoni

Leo Sauermann pisze:
> Hi Aperturians,
> 
> in the organik-project.eu, and for other things, it would be awesome to
> have an "aperture crawling server" that is an installable war file that
> just crawls some configured datasources and then does something with the
> crawled rdf.
> 
> for example, we have this company that has a fileshare on a windows
> server. Now I would like to install "aperture crawling server" on some
> machine as WAR file, instruct it using a web-interface to crawl a
> datasource "windows share" (and maybe some internal websites using the
> webcrawler and maybe some newsletter using imap) and off it goes to do this.
> Then I would like to configure the "SOLR crawling handler" or the
> "drupal" crawling handler to tell the aperture crawling server what to
> do with the RDF.
> 
> I know that the people having done the SOLR integration of Aperture
> probably did something exactly like this - do we have some open source
> code now available for it?
> 
> have you written an aperture-based crawling server that you can share
> with me?
> (doesn't have to be completly open source, but only looking at the code
> and the architecture could teach me a lot)
> 
> would you think this is a rocking idea and join a new subproject in
> Aperture for this?
> (just say yes and then we go on from this, forming a group with
> requirements, etc)
> 
> has anyone written a windows fileshare-crawler? What java libraries are
> there to crawl windows/samba shares?
> (that would be awesome)
> 
> best
> Leo
>

Re: [Aperture-devel] aperture crawling server with configurable datasources - and windows share crawling - SOLR guys?

From: Leo S. <leo...@df...> - 2009-12-21 14:27:57

Hi

Ok, so crawling many datasources is exactly "spot on" for aperture,
how about an aperture server software that makes the whole thing a
"product" and not just a "library"?

best
Leo

It was Antoni Mylka who said at the right time 21.12.2009 12:10 the
following words:
> Leo, Aperturians
>
> The idea is great, if there is more interest in using Aperture in SOLR
> then we could expand in this direction. What is needed is feedback what
> data sources would you like to use with SOLR (or use already). If there
> is need, we could think about expanding in that direction.
>
> Some ideas from the top of my head:
>
> - elevate the MimeSubCrawler into an mbox subcrawler, that would crawl
> plain text mailing list archives properly, or make proper mbox crawling
> results appear in the FileSystemCrawler output
> - extend the output of flickr/delicious/bibsonomy subcrawlers, let them
> extract photo comments, or publication abstracts
> - think about some SambaCrawler or FTPCrawler or pick up the abandoned
> WebdavCrawler that that would crawl remote folders without having to
> mount them locally first
> - LdapCrawler? XMLDbCrawler?
>
> sky is the limit
>
> Antoni
>
> Leo Sauermann pisze:
>   
>> Hi Aperturians,
>>
>> in the organik-project.eu, and for other things, it would be awesome to
>> have an "aperture crawling server" that is an installable war file that
>> just crawls some configured datasources and then does something with the
>> crawled rdf.
>>
>> for example, we have this company that has a fileshare on a windows
>> server. Now I would like to install "aperture crawling server" on some
>> machine as WAR file, instruct it using a web-interface to crawl a
>> datasource "windows share" (and maybe some internal websites using the
>> webcrawler and maybe some newsletter using imap) and off it goes to do this.
>> Then I would like to configure the "SOLR crawling handler" or the
>> "drupal" crawling handler to tell the aperture crawling server what to
>> do with the RDF.
>>
>> I know that the people having done the SOLR integration of Aperture
>> probably did something exactly like this - do we have some open source
>> code now available for it?
>>
>> have you written an aperture-based crawling server that you can share
>> with me?
>> (doesn't have to be completly open source, but only looking at the code
>> and the architecture could teach me a lot)
>>
>> would you think this is a rocking idea and join a new subproject in
>> Aperture for this?
>> (just say yes and then we go on from this, forming a group with
>> requirements, etc)
>>
>> has anyone written a windows fileshare-crawler? What java libraries are
>> there to crawl windows/samba shares?
>> (that would be awesome)
>>
>> best
>> Leo
>>
>>     
>
>
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev 
> _______________________________________________
> Aperture-devel mailing list
> Ape...@li...
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>   


-- 
_____________________________________________________
Dr. Leo Sauermann       http://www.dfki.de/~sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +43 6991 gnowsis
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo...@df...

Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
_____________________________________________________

Re: [Aperture-devel] aperture crawling server with configurable datasources - and windows share crawling - SOLR guys?

From: Christian R. <chr...@gm...> - 2009-12-21 15:47:23

Attachments: signature.asc

oh yes - it would be a very interesting and useful scenario to have this
possibility. Entry points for using could be e.g.

- controling the crawling process with RPC's (running of the extisting
  interface as a service)
- configuring of the data sources that should be crawled periodically. 
- status interface for stuff like 'what the server does currently, or some
  statistics what he has done (which is independent from the used
  index/persistence layer)
- configuring of the persistance layer(s) that should be used (e.g. SOLR,
  Lucene, Databse, RDFStore, etc)


I imagine that currently more or less each project that uses aperture has to
to similar stuff on its own. (This e.g. is the truth for DynaQ). Supporting
people to minimize their entry level is always a good idea for an open source
project.
But we have to be careful not to lose flexibility. Big monolytic blocks are
real pain.


My 2 cts

Chris



On Mon, 21 Dec 2009 15:27:36 +0100
Leo Sauermann <leo...@df...> wrote:

> Hi
> 
> Ok, so crawling many datasources is exactly "spot on" for aperture,
> how about an aperture server software that makes the whole thing a
> "product" and not just a "library"?
> 
> best
> Leo
> 
> It was Antoni Mylka who said at the right time 21.12.2009 12:10 the
> following words:
> > Leo, Aperturians
> >
> > The idea is great, if there is more interest in using Aperture in SOLR
> > then we could expand in this direction. What is needed is feedback what
> > data sources would you like to use with SOLR (or use already). If there
> > is need, we could think about expanding in that direction.
> >
> > Some ideas from the top of my head:
> >
> > - elevate the MimeSubCrawler into an mbox subcrawler, that would crawl
> > plain text mailing list archives properly, or make proper mbox crawling
> > results appear in the FileSystemCrawler output
> > - extend the output of flickr/delicious/bibsonomy subcrawlers, let them
> > extract photo comments, or publication abstracts
> > - think about some SambaCrawler or FTPCrawler or pick up the abandoned
> > WebdavCrawler that that would crawl remote folders without having to
> > mount them locally first
> > - LdapCrawler? XMLDbCrawler?
> >
> > sky is the limit
> >
> > Antoni
> >
> > Leo Sauermann pisze:
> >   
> >> Hi Aperturians,
> >>
> >> in the organik-project.eu, and for other things, it would be awesome to
> >> have an "aperture crawling server" that is an installable war file that
> >> just crawls some configured datasources and then does something with the
> >> crawled rdf.
> >>
> >> for example, we have this company that has a fileshare on a windows
> >> server. Now I would like to install "aperture crawling server" on some
> >> machine as WAR file, instruct it using a web-interface to crawl a
> >> datasource "windows share" (and maybe some internal websites using the
> >> webcrawler and maybe some newsletter using imap) and off it goes to do this.
> >> Then I would like to configure the "SOLR crawling handler" or the
> >> "drupal" crawling handler to tell the aperture crawling server what to
> >> do with the RDF.
> >>
> >> I know that the people having done the SOLR integration of Aperture
> >> probably did something exactly like this - do we have some open source
> >> code now available for it?
> >>
> >> have you written an aperture-based crawling server that you can share
> >> with me?
> >> (doesn't have to be completly open source, but only looking at the code
> >> and the architecture could teach me a lot)
> >>
> >> would you think this is a rocking idea and join a new subproject in
> >> Aperture for this?
> >> (just say yes and then we go on from this, forming a group with
> >> requirements, etc)
> >>
> >> has anyone written a windows fileshare-crawler? What java libraries are
> >> there to crawl windows/samba shares?
> >> (that would be awesome)
> >>
> >> best
> >> Leo
> >>
> >>     
> >
> >
> > ------------------------------------------------------------------------------
> > This SF.Net email is sponsored by the Verizon Developer Community
> > Take advantage of Verizon's best-in-class app development support
> > A streamlined, 14 day to market process makes app distribution fast and easy
> > Join now and get one step closer to millions of Verizon customers
> > http://p.sf.net/sfu/verizon-dev2dev 
> > _______________________________________________
> > Aperture-devel mailing list
> > Ape...@li...
> > https://lists.sourceforge.net/lists/listinfo/aperture-devel
> >   
> 
>

Re: [Aperture-devel] [Aperture-devil] aperture crawling server with configurable data sources - and windows share crawling - SOLR guys?

From: Berwanger, C. <chr...@lo...> - 2009-12-21 15:52:08

Hi

Wasn't this the idea behind SMILA to provide an generic architecture and
that common used features?

How would this server integrated with SMILA?

Christian

-----Original Message-----
From: Christian Reuschling [mailto:chr...@gm...] 
Sent: 21 December 2009 16:26
To: ape...@li...
Subject: Re: [Aperture-devel] aperture crawling server with configurable
datasources - and windows share crawling - SOLR guys?

oh yes - it would be a very interesting and useful scenario to have this
possibility. Entry points for using could be e.g.

- controling the crawling process with RPC's (running of the extisting
  interface as a service)
- configuring of the data sources that should be crawled periodically. 
- status interface for stuff like 'what the server does currently, or
some
  statistics what he has done (which is independent from the used
  index/persistence layer)
- configuring of the persistance layer(s) that should be used (e.g.
SOLR,
  Lucene, Databse, RDFStore, etc)


I imagine that currently more or less each project that uses aperture
has to to similar stuff on its own. (This e.g. is the truth for DynaQ).
Supporting people to minimize their entry level is always a good idea
for an open source project.
But we have to be careful not to lose flexibility. Big monolytic blocks
are real pain.


My 2 cts

Chris



On Mon, 21 Dec 2009 15:27:36 +0100
Leo Sauermann <leo...@df...> wrote:

> Hi
> 
> Ok, so crawling many datasources is exactly "spot on" for aperture, 
> how about an aperture server software that makes the whole thing a 
> "product" and not just a "library"?
> 
> best
> Leo
> 
> It was Antoni Mylka who said at the right time 21.12.2009 12:10 the 
> following words:
> > Leo, Aperturians
> >
> > The idea is great, if there is more interest in using Aperture in 
> > SOLR then we could expand in this direction. What is needed is 
> > feedback what data sources would you like to use with SOLR (or use 
> > already). If there is need, we could think about expanding in that
direction.
> >
> > Some ideas from the top of my head:
> >
> > - elevate the MimeSubCrawler into an mbox subcrawler, that would 
> > crawl plain text mailing list archives properly, or make proper mbox

> > crawling results appear in the FileSystemCrawler output
> > - extend the output of flickr/delicious/bibsonomy subcrawlers, let 
> > them extract photo comments, or publication abstracts
> > - think about some SambaCrawler or FTPCrawler or pick up the 
> > abandoned WebdavCrawler that that would crawl remote folders without

> > having to mount them locally first
> > - LdapCrawler? XMLDbCrawler?
> >
> > sky is the limit
> >
> > Antoni
> >
> > Leo Sauermann pisze:
> >   
> >> Hi Aperturians,
> >>
> >> in the organik-project.eu, and for other things, it would be 
> >> awesome to have an "aperture crawling server" that is an 
> >> installable war file that just crawls some configured datasources 
> >> and then does something with the crawled rdf.
> >>
> >> for example, we have this company that has a fileshare on a windows

> >> server. Now I would like to install "aperture crawling server" on 
> >> some machine as WAR file, instruct it using a web-interface to 
> >> crawl a datasource "windows share" (and maybe some internal 
> >> websites using the webcrawler and maybe some newsletter using imap)
and off it goes to do this.
> >> Then I would like to configure the "SOLR crawling handler" or the 
> >> "drupal" crawling handler to tell the aperture crawling server what

> >> to do with the RDF.
> >>
> >> I know that the people having done the SOLR integration of Aperture

> >> probably did something exactly like this - do we have some open 
> >> source code now available for it?
> >>
> >> have you written an aperture-based crawling server that you can 
> >> share with me?
> >> (doesn't have to be completly open source, but only looking at the 
> >> code and the architecture could teach me a lot)
> >>
> >> would you think this is a rocking idea and join a new subproject in

> >> Aperture for this?
> >> (just say yes and then we go on from this, forming a group with 
> >> requirements, etc)
> >>
> >> has anyone written a windows fileshare-crawler? What java libraries

> >> are there to crawl windows/samba shares?
> >> (that would be awesome)
> >>
> >> best
> >> Leo
> >>
> >>     
> >
> >
> > --------------------------------------------------------------------
> > ---------- This SF.Net email is sponsored by the Verizon Developer 
> > Community Take advantage of Verizon's best-in-class app development 
> > support A streamlined, 14 day to market process makes app 
> > distribution fast and easy Join now and get one step closer to 
> > millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev
> > _______________________________________________
> > Aperture-devel mailing list
> > Ape...@li...
> > https://lists.sourceforge.net/lists/listinfo/aperture-devel
> >   
> 
> 


Please help Logica to respect the environment by not printing this email  / Pour contribuer comme Logica au respect de l'environnement, merci de ne pas imprimer ce mail /  Bitte drucken Sie diese Nachricht nicht aus und helfen Sie so Logica dabei, die Umwelt zu schützen. /  Por favor ajude a Logica a respeitar o ambiente nao imprimindo este correio electronico.



This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

Re: [Aperture-devel] aperture crawling server with configurable data sources - and windows share crawling - SOLR guys?

From: Leo S. <leo...@df...> - 2009-12-21 19:41:42

Hi

Of course, that would also be part of SMILA, but they have their own
crawling architecture and the plans to use the aperture crawling
architecture in SMILA are only slowly moving forward. They only have
used the Aperture extractors, which is only half the thing.
Also, SMILA is a massively scaling enterprise architecture with a
distributed architecture and other nicetities for enterprise computing.

I need something that works "in a month", and without adding any more fuzz.

btw: the list is called aperture-devel, check your mail client, you
wrote something else which I personally do not like very much.

best
Leo

It was Berwanger, Christian who said at the right time 21.12.2009 16:51
the following words:
> Hi
>
> Wasn't this the idea behind SMILA to provide an generic architecture and
> that common used features?
>
> How would this server integrated with SMILA?
>
> Christian
>
> -----Original Message-----
> From: Christian Reuschling [mailto:chr...@gm...] 
> Sent: 21 December 2009 16:26
> To: ape...@li...
> Subject: Re: [Aperture-devel] aperture crawling server with configurable
> datasources - and windows share crawling - SOLR guys?
>
> oh yes - it would be a very interesting and useful scenario to have this
> possibility. Entry points for using could be e.g.
>
> - controling the crawling process with RPC's (running of the extisting
>   interface as a service)
> - configuring of the data sources that should be crawled periodically. 
> - status interface for stuff like 'what the server does currently, or
> some
>   statistics what he has done (which is independent from the used
>   index/persistence layer)
> - configuring of the persistance layer(s) that should be used (e.g.
> SOLR,
>   Lucene, Databse, RDFStore, etc)
>
>
> I imagine that currently more or less each project that uses aperture
> has to to similar stuff on its own. (This e.g. is the truth for DynaQ).
> Supporting people to minimize their entry level is always a good idea
> for an open source project.
> But we have to be careful not to lose flexibility. Big monolytic blocks
> are real pain.
>
>
> My 2 cts
>
> Chris
>
>
>
> On Mon, 21 Dec 2009 15:27:36 +0100
> Leo Sauermann <leo...@df...> wrote:
>
>   
>> Hi
>>
>> Ok, so crawling many datasources is exactly "spot on" for aperture, 
>> how about an aperture server software that makes the whole thing a 
>> "product" and not just a "library"?
>>
>> best
>> Leo
>>
>> It was Antoni Mylka who said at the right time 21.12.2009 12:10 the 
>> following words:
>>     
>>> Leo, Aperturians
>>>
>>> The idea is great, if there is more interest in using Aperture in 
>>> SOLR then we could expand in this direction. What is needed is 
>>> feedback what data sources would you like to use with SOLR (or use 
>>> already). If there is need, we could think about expanding in that
>>>       
> direction.
>   
>>> Some ideas from the top of my head:
>>>
>>> - elevate the MimeSubCrawler into an mbox subcrawler, that would 
>>> crawl plain text mailing list archives properly, or make proper mbox
>>>       
>
>   
>>> crawling results appear in the FileSystemCrawler output
>>> - extend the output of flickr/delicious/bibsonomy subcrawlers, let 
>>> them extract photo comments, or publication abstracts
>>> - think about some SambaCrawler or FTPCrawler or pick up the 
>>> abandoned WebdavCrawler that that would crawl remote folders without
>>>       
>
>   
>>> having to mount them locally first
>>> - LdapCrawler? XMLDbCrawler?
>>>
>>> sky is the limit
>>>
>>> Antoni
>>>
>>> Leo Sauermann pisze:
>>>   
>>>       
>>>> Hi Aperturians,
>>>>
>>>> in the organik-project.eu, and for other things, it would be 
>>>> awesome to have an "aperture crawling server" that is an 
>>>> installable war file that just crawls some configured datasources 
>>>> and then does something with the crawled rdf.
>>>>
>>>> for example, we have this company that has a fileshare on a windows
>>>>         
>
>   
>>>> server. Now I would like to install "aperture crawling server" on 
>>>> some machine as WAR file, instruct it using a web-interface to 
>>>> crawl a datasource "windows share" (and maybe some internal 
>>>> websites using the webcrawler and maybe some newsletter using imap)
>>>>         
> and off it goes to do this.
>   
>>>> Then I would like to configure the "SOLR crawling handler" or the 
>>>> "drupal" crawling handler to tell the aperture crawling server what
>>>>         
>
>   
>>>> to do with the RDF.
>>>>
>>>> I know that the people having done the SOLR integration of Aperture
>>>>         
>
>   
>>>> probably did something exactly like this - do we have some open 
>>>> source code now available for it?
>>>>
>>>> have you written an aperture-based crawling server that you can 
>>>> share with me?
>>>> (doesn't have to be completly open source, but only looking at the 
>>>> code and the architecture could teach me a lot)
>>>>
>>>> would you think this is a rocking idea and join a new subproject in
>>>>         
>
>   
>>>> Aperture for this?
>>>> (just say yes and then we go on from this, forming a group with 
>>>> requirements, etc)
>>>>
>>>> has anyone written a windows fileshare-crawler? What java libraries
>>>>         
>
>   
>>>> are there to crawl windows/samba shares?
>>>> (that would be awesome)
>>>>
>>>> best
>>>> Leo
>>>>
>>>>     
>>>>         
>>> --------------------------------------------------------------------
>>> ---------- This SF.Net email is sponsored by the Verizon Developer 
>>> Community Take advantage of Verizon's best-in-class app development 
>>> support A streamlined, 14 day to market process makes app 
>>> distribution fast and easy Join now and get one step closer to 
>>> millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev
>>> _______________________________________________
>>> Aperture-devel mailing list
>>> Ape...@li...
>>> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>>>   
>>>       
>>     
>
>
> Please help Logica to respect the environment by not printing this email  / Pour contribuer comme Logica au respect de l'environnement, merci de ne pas imprimer ce mail /  Bitte drucken Sie diese Nachricht nicht aus und helfen Sie so Logica dabei, die Umwelt zu schützen. /  Por favor ajude a Logica a respeitar o ambiente nao imprimindo este correio electronico.
>
>
>
> This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
>
>
>
>   
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev 
> ------------------------------------------------------------------------
>
> _______________________________________________
> Aperture-devel mailing list
> Ape...@li...
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>   


-- 
_____________________________________________________
Dr. Leo Sauermann       http://www.dfki.de/~sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +43 6991 gnowsis
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo...@df...

Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
_____________________________________________________

Re: [Aperture-devel] aperture crawling server with configurable datasources - and windows share crawling - SOLR guys?

From: Antoni M. <ant...@gm...> - 2009-12-21 21:26:38

Hi,

The "product" would have to be an "Aperture Server", which would be a
distribution of SOLR, with the Aperture integration bundled and enabled
by default. Or? It would be a very simple task if
 - we learn a bit about SOLR architecture and packaging (or find someone
 who'd be interested in this)
 - and have any necessary code that glues Aperture With SOLR written and
tested

Antoni

Leo Sauermann pisze:
> Hi
> 
> Ok, so crawling many datasources is exactly "spot on" for aperture,
> how about an aperture server software that makes the whole thing a
> "product" and not just a "library"?
> 
> best
> Leo
> 
> It was Antoni Mylka who said at the right time 21.12.2009 12:10 the
> following words:
>> Leo, Aperturians
>>
>> The idea is great, if there is more interest in using Aperture in SOLR
>> then we could expand in this direction. What is needed is feedback what
>> data sources would you like to use with SOLR (or use already). If there
>> is need, we could think about expanding in that direction.
>>
>> Some ideas from the top of my head:
>>
>> - elevate the MimeSubCrawler into an mbox subcrawler, that would crawl
>> plain text mailing list archives properly, or make proper mbox crawling
>> results appear in the FileSystemCrawler output
>> - extend the output of flickr/delicious/bibsonomy subcrawlers, let them
>> extract photo comments, or publication abstracts
>> - think about some SambaCrawler or FTPCrawler or pick up the abandoned
>> WebdavCrawler that that would crawl remote folders without having to
>> mount them locally first
>> - LdapCrawler? XMLDbCrawler?
>>
>> sky is the limit
>>
>> Antoni
>>
>> Leo Sauermann pisze:
>>   
>>> Hi Aperturians,
>>>
>>> in the organik-project.eu, and for other things, it would be awesome to
>>> have an "aperture crawling server" that is an installable war file that
>>> just crawls some configured datasources and then does something with the
>>> crawled rdf.
>>>
>>> for example, we have this company that has a fileshare on a windows
>>> server. Now I would like to install "aperture crawling server" on some
>>> machine as WAR file, instruct it using a web-interface to crawl a
>>> datasource "windows share" (and maybe some internal websites using the
>>> webcrawler and maybe some newsletter using imap) and off it goes to do this.
>>> Then I would like to configure the "SOLR crawling handler" or the
>>> "drupal" crawling handler to tell the aperture crawling server what to
>>> do with the RDF.
>>>
>>> I know that the people having done the SOLR integration of Aperture
>>> probably did something exactly like this - do we have some open source
>>> code now available for it?
>>>
>>> have you written an aperture-based crawling server that you can share
>>> with me?
>>> (doesn't have to be completly open source, but only looking at the code
>>> and the architecture could teach me a lot)
>>>
>>> would you think this is a rocking idea and join a new subproject in
>>> Aperture for this?
>>> (just say yes and then we go on from this, forming a group with
>>> requirements, etc)
>>>
>>> has anyone written a windows fileshare-crawler? What java libraries are
>>> there to crawl windows/samba shares?
>>> (that would be awesome)
>>>
>>> best
>>> Leo
>>>
>>>     
>>
>>
>> ------------------------------------------------------------------------------
>> This SF.Net email is sponsored by the Verizon Developer Community
>> Take advantage of Verizon's best-in-class app development support
>> A streamlined, 14 day to market process makes app distribution fast and easy
>> Join now and get one step closer to millions of Verizon customers
>> http://p.sf.net/sfu/verizon-dev2dev 
>> _______________________________________________
>> Aperture-devel mailing list
>> Ape...@li...
>> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>>   
> 
> 
> -- 
> _____________________________________________________
> Dr. Leo Sauermann       http://www.dfki.de/~sauermann 
> 
> Deutsches Forschungszentrum fuer 
> Kuenstliche Intelligenz DFKI GmbH
> Trippstadter Strasse 122
> P.O. Box 2080           Fon:   +43 6991 gnowsis
> D-67663 Kaiserslautern  Fax:   +49 631 20575-102
> Germany                 Mail:  leo...@df...
> 
> Geschaeftsfuehrung:
> Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313
> _____________________________________________________
>

Re: [Aperture-devel] aperture crawling server with configurable datasources - and windows share crawling - SOLR guys?

From: Christiaan F. <chr...@ad...> - 2009-12-21 22:16:29

I'm not a SOLR expert, so I hope I am not saying anything stupid here.

Part of the beauty of SOLR as I understand it is that it only provides 
functionalities directly related to Lucene and nothing more. This 
enables very flexible deployment strategies.

Rather than integrating Aperture in SOLR, I would rather create a 
similar server layer around Aperture that can do crawling and extraction 
and has one or more output handlers (basically a CrawlerHandler 
implementation) that determine what is to be done with the results. One 
of these plugins could post it to a SOLR server.

One reason for this: Aperture crawling may have different requirements 
than the app that will use the extracted metadata. For example: Aperture 
can crawl an enterprise file server but SOLR will redirect users to 
those files via a web server. Separating the two functionalities will 
let you deploy SOLR "a few firewalls away" - this will keep a lot of 
sysadmins happy. Also, crawling and extraction is very CPU-intensive: 
for large datasets you don't want to run that on the same machine as 
your front-end.

Finally, SOLR is just one output medium, there may be a lot more, so why 
limit ourselves by design to SOLR?

Regards,

Chris #3 :)
--

Antoni Mylka wrote:
> Hi,
> 
> The "product" would have to be an "Aperture Server", which would be a
> distribution of SOLR, with the Aperture integration bundled and enabled
> by default. Or? It would be a very simple task if
>  - we learn a bit about SOLR architecture and packaging (or find someone
>  who'd be interested in this)
>  - and have any necessary code that glues Aperture With SOLR written and
> tested
> 
> Antoni
> 
> Leo Sauermann pisze:
>> Hi
>>
>> Ok, so crawling many datasources is exactly "spot on" for aperture,
>> how about an aperture server software that makes the whole thing a
>> "product" and not just a "library"?
>>
>> best
>> Leo
>>
>> It was Antoni Mylka who said at the right time 21.12.2009 12:10 the
>> following words:
>>> Leo, Aperturians
>>>
>>> The idea is great, if there is more interest in using Aperture in SOLR
>>> then we could expand in this direction. What is needed is feedback what
>>> data sources would you like to use with SOLR (or use already). If there
>>> is need, we could think about expanding in that direction.
>>>
>>> Some ideas from the top of my head:
>>>
>>> - elevate the MimeSubCrawler into an mbox subcrawler, that would crawl
>>> plain text mailing list archives properly, or make proper mbox crawling
>>> results appear in the FileSystemCrawler output
>>> - extend the output of flickr/delicious/bibsonomy subcrawlers, let them
>>> extract photo comments, or publication abstracts
>>> - think about some SambaCrawler or FTPCrawler or pick up the abandoned
>>> WebdavCrawler that that would crawl remote folders without having to
>>> mount them locally first
>>> - LdapCrawler? XMLDbCrawler?
>>>
>>> sky is the limit
>>>
>>> Antoni
>>>
>>> Leo Sauermann pisze:
>>>   
>>>> Hi Aperturians,
>>>>
>>>> in the organik-project.eu, and for other things, it would be awesome to
>>>> have an "aperture crawling server" that is an installable war file that
>>>> just crawls some configured datasources and then does something with the
>>>> crawled rdf.
>>>>
>>>> for example, we have this company that has a fileshare on a windows
>>>> server. Now I would like to install "aperture crawling server" on some
>>>> machine as WAR file, instruct it using a web-interface to crawl a
>>>> datasource "windows share" (and maybe some internal websites using the
>>>> webcrawler and maybe some newsletter using imap) and off it goes to do this.
>>>> Then I would like to configure the "SOLR crawling handler" or the
>>>> "drupal" crawling handler to tell the aperture crawling server what to
>>>> do with the RDF.
>>>>
>>>> I know that the people having done the SOLR integration of Aperture
>>>> probably did something exactly like this - do we have some open source
>>>> code now available for it?
>>>>
>>>> have you written an aperture-based crawling server that you can share
>>>> with me?
>>>> (doesn't have to be completly open source, but only looking at the code
>>>> and the architecture could teach me a lot)
>>>>
>>>> would you think this is a rocking idea and join a new subproject in
>>>> Aperture for this?
>>>> (just say yes and then we go on from this, forming a group with
>>>> requirements, etc)
>>>>
>>>> has anyone written a windows fileshare-crawler? What java libraries are
>>>> there to crawl windows/samba shares?
>>>> (that would be awesome)
>>>>
>>>> best
>>>> Leo
>>>>
>>>>     
>>>
>>> ------------------------------------------------------------------------------
>>> This SF.Net email is sponsored by the Verizon Developer Community
>>> Take advantage of Verizon's best-in-class app development support
>>> A streamlined, 14 day to market process makes app distribution fast and easy
>>> Join now and get one step closer to millions of Verizon customers
>>> http://p.sf.net/sfu/verizon-dev2dev 
>>> _______________________________________________
>>> Aperture-devel mailing list
>>> Ape...@li...
>>> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>>>   
>>
>> -- 
>> _____________________________________________________
>> Dr. Leo Sauermann       http://www.dfki.de/~sauermann 
>>
>> Deutsches Forschungszentrum fuer 
>> Kuenstliche Intelligenz DFKI GmbH
>> Trippstadter Strasse 122
>> P.O. Box 2080           Fon:   +43 6991 gnowsis
>> D-67663 Kaiserslautern  Fax:   +49 631 20575-102
>> Germany                 Mail:  leo...@df...
>>
>> Geschaeftsfuehrung:
>> Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
>> Dr. Walter Olthoff
>> Vorsitzender des Aufsichtsrats:
>> Prof. Dr. h.c. Hans A. Aukes
>> Amtsgericht Kaiserslautern, HRB 2313
>> _____________________________________________________
>>
> 
> 
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev 
> _______________________________________________
> Aperture-devel mailing list
> Ape...@li...
> https://lists.sourceforge.net/lists/listinfo/aperture-devel

Re: [Aperture-devel] aperture crawling server with configurable datasources - and windows share crawling - SOLR guys?

From: Antoni M. <ant...@gm...> - 2009-12-21 23:44:22

Hi,

So combining the mails of cfmfluit with the one from reuschling (I hope
you won't mind my using the SF login names). This means the following
requirements:

- a server that manages data sources and crawlers
- configurable remotely
- separate from any frontend
- and from any data storage implementation

IMHO the easiest way to jumpstart it would be to take the last snapshot
of the aperture datawrapper from nepomuk.semanticdesktop.org and build
around it. The basic stuff is there, the code to start and monitor the
crawler progress, we would only need the "output handlers", some remote
configuration facility (XMLRPC?, SOAP?, REST?) and a basic war to host
it all. It has another advantage, that it hides 99% of all Aperture
functionality behind a single interface, also usable in a normal
single-jvm plain java applications.

With proper design the same code could be usable as it was in
Eclipse-based NEPOMUK-like applications, in our remotely-accessible
Aperture Server, and in a normal desktop application whose author just
wants to learn a single interface that just works instead of 50 pages of
tutorials.

The only problem with that is that it duplicates the ideas laid out for
the ApertureRuntime class (one facade to rule them all). With
ApertureDataWrapper we'd have two.

What do you think?

Antoni



Christiaan Fluit pisze:
> I'm not a SOLR expert, so I hope I am not saying anything stupid here.
> 
> Part of the beauty of SOLR as I understand it is that it only provides 
> functionalities directly related to Lucene and nothing more. This 
> enables very flexible deployment strategies.
> 
> Rather than integrating Aperture in SOLR, I would rather create a 
> similar server layer around Aperture that can do crawling and extraction 
> and has one or more output handlers (basically a CrawlerHandler 
> implementation) that determine what is to be done with the results. One 
> of these plugins could post it to a SOLR server.
> 
> One reason for this: Aperture crawling may have different requirements 
> than the app that will use the extracted metadata. For example: Aperture 
> can crawl an enterprise file server but SOLR will redirect users to 
> those files via a web server. Separating the two functionalities will 
> let you deploy SOLR "a few firewalls away" - this will keep a lot of 
> sysadmins happy. Also, crawling and extraction is very CPU-intensive: 
> for large datasets you don't want to run that on the same machine as 
> your front-end.
> 
> Finally, SOLR is just one output medium, there may be a lot more, so why 
> limit ourselves by design to SOLR?
> 
> Regards,
> 
> Chris #3 :)
> --
> 
> Antoni Mylka wrote:
>> Hi,
>>
>> The "product" would have to be an "Aperture Server", which would be a
>> distribution of SOLR, with the Aperture integration bundled and enabled
>> by default. Or? It would be a very simple task if
>>  - we learn a bit about SOLR architecture and packaging (or find someone
>>  who'd be interested in this)
>>  - and have any necessary code that glues Aperture With SOLR written and
>> tested
>>
>> Antoni
>>
>> Leo Sauermann pisze:
>>> Hi
>>>
>>> Ok, so crawling many datasources is exactly "spot on" for aperture,
>>> how about an aperture server software that makes the whole thing a
>>> "product" and not just a "library"?
>>>
>>> best
>>> Leo
>>>
>>> It was Antoni Mylka who said at the right time 21.12.2009 12:10 the
>>> following words:
>>>> Leo, Aperturians
>>>>
>>>> The idea is great, if there is more interest in using Aperture in SOLR
>>>> then we could expand in this direction. What is needed is feedback what
>>>> data sources would you like to use with SOLR (or use already). If there
>>>> is need, we could think about expanding in that direction.
>>>>
>>>> Some ideas from the top of my head:
>>>>
>>>> - elevate the MimeSubCrawler into an mbox subcrawler, that would crawl
>>>> plain text mailing list archives properly, or make proper mbox crawling
>>>> results appear in the FileSystemCrawler output
>>>> - extend the output of flickr/delicious/bibsonomy subcrawlers, let them
>>>> extract photo comments, or publication abstracts
>>>> - think about some SambaCrawler or FTPCrawler or pick up the abandoned
>>>> WebdavCrawler that that would crawl remote folders without having to
>>>> mount them locally first
>>>> - LdapCrawler? XMLDbCrawler?
>>>>
>>>> sky is the limit
>>>>
>>>> Antoni
>>>>
>>>> Leo Sauermann pisze:
>>>>   
>>>>> Hi Aperturians,
>>>>>
>>>>> in the organik-project.eu, and for other things, it would be awesome to
>>>>> have an "aperture crawling server" that is an installable war file that
>>>>> just crawls some configured datasources and then does something with the
>>>>> crawled rdf.
>>>>>
>>>>> for example, we have this company that has a fileshare on a windows
>>>>> server. Now I would like to install "aperture crawling server" on some
>>>>> machine as WAR file, instruct it using a web-interface to crawl a
>>>>> datasource "windows share" (and maybe some internal websites using the
>>>>> webcrawler and maybe some newsletter using imap) and off it goes to do this.
>>>>> Then I would like to configure the "SOLR crawling handler" or the
>>>>> "drupal" crawling handler to tell the aperture crawling server what to
>>>>> do with the RDF.
>>>>>
>>>>> I know that the people having done the SOLR integration of Aperture
>>>>> probably did something exactly like this - do we have some open source
>>>>> code now available for it?
>>>>>
>>>>> have you written an aperture-based crawling server that you can share
>>>>> with me?
>>>>> (doesn't have to be completly open source, but only looking at the code
>>>>> and the architecture could teach me a lot)
>>>>>
>>>>> would you think this is a rocking idea and join a new subproject in
>>>>> Aperture for this?
>>>>> (just say yes and then we go on from this, forming a group with
>>>>> requirements, etc)
>>>>>
>>>>> has anyone written a windows fileshare-crawler? What java libraries are
>>>>> there to crawl windows/samba shares?
>>>>> (that would be awesome)
>>>>>
>>>>> best
>>>>> Leo
>>>>>
>>>>>     
>>>> ------------------------------------------------------------------------------
>>>> This SF.Net email is sponsored by the Verizon Developer Community
>>>> Take advantage of Verizon's best-in-class app development support
>>>> A streamlined, 14 day to market process makes app distribution fast and easy
>>>> Join now and get one step closer to millions of Verizon customers
>>>> http://p.sf.net/sfu/verizon-dev2dev 
>>>> _______________________________________________
>>>> Aperture-devel mailing list
>>>> Ape...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>>>>   
>>> -- 
>>> _____________________________________________________
>>> Dr. Leo Sauermann       http://www.dfki.de/~sauermann 
>>>
>>> Deutsches Forschungszentrum fuer 
>>> Kuenstliche Intelligenz DFKI GmbH
>>> Trippstadter Strasse 122
>>> P.O. Box 2080           Fon:   +43 6991 gnowsis
>>> D-67663 Kaiserslautern  Fax:   +49 631 20575-102
>>> Germany                 Mail:  leo...@df...
>>>
>>> Geschaeftsfuehrung:
>>> Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> _____________________________________________________
>>>
>>
>> ------------------------------------------------------------------------------
>> This SF.Net email is sponsored by the Verizon Developer Community
>> Take advantage of Verizon's best-in-class app development support
>> A streamlined, 14 day to market process makes app distribution fast and easy
>> Join now and get one step closer to millions of Verizon customers
>> http://p.sf.net/sfu/verizon-dev2dev 
>> _______________________________________________
>> Aperture-devel mailing list
>> Ape...@li...
>> https://lists.sourceforge.net/lists/listinfo/aperture-devel
> 
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev 
> _______________________________________________
> Aperture-devel mailing list
> Ape...@li...
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>

Re: [Aperture-devel] aperture crawling server with configurable datasources - and windows share crawling - SOLR guys?

From: Leo S. <leo...@df...> - 2010-01-05 14:46:53

Hi Aperturelles,

ok, so we have a plan.

and a starting point - we already have a web server for aperture, I will
extend this:
https://aperture.svn.sourceforge.net/svnroot/aperture/aperture-webserver

I made a tag of that as-of-today:
https://aperture.svn.sourceforge.net/svnroot/aperture/aperture-webserver/tags/aperture-webserver-2010-01-beforeTHESERVER

and start hacking away on it NOW. I will make it work now for Drupal and
then we can add a layer for SOLR and other stores.
(we need this right now - crawl something, add it as drupal nodes. its
urgent)

This is fine and maps to what Antoni said:
* I will take some of the code from nepomuk datawrapper, it is a working
starting point for crawling. (note: I will also sort out copyright
issues for this)
* Drupal, SOLR, ... whatever - they are just CrawlerHandlers to me.
Integration is always done with some client library. For SOLR - it
should always be used remotely
* interface is going to be REST for now, with a limited GUI.

I think for the moment I can't use any help, because I know all the code
involved and have a problem to solve quickly. Once that works, I would
need some help to move it further.

(in another mail I will ask for some help on the crawler)

I guesstimate that something works until mid-february.

best
Leo


It was Antoni Mylka who said at the right time 22.12.2009 00:44 the
following words:
> Hi,
>
> So combining the mails of cfmfluit with the one from reuschling (I hope
> you won't mind my using the SF login names). This means the following
> requirements:
>
> - a server that manages data sources and crawlers
> - configurable remotely
> - separate from any frontend
> - and from any data storage implementation
>
> IMHO the easiest way to jumpstart it would be to take the last snapshot
> of the aperture datawrapper from nepomuk.semanticdesktop.org and build
> around it. The basic stuff is there, the code to start and monitor the
> crawler progress, we would only need the "output handlers", some remote
> configuration facility (XMLRPC?, SOAP?, REST?) and a basic war to host
> it all. It has another advantage, that it hides 99% of all Aperture
> functionality behind a single interface, also usable in a normal
> single-jvm plain java applications.
>
> With proper design the same code could be usable as it was in
> Eclipse-based NEPOMUK-like applications, in our remotely-accessible
> Aperture Server, and in a normal desktop application whose author just
> wants to learn a single interface that just works instead of 50 pages of
> tutorials.
>
> The only problem with that is that it duplicates the ideas laid out for
> the ApertureRuntime class (one facade to rule them all). With
> ApertureDataWrapper we'd have two.
>
> What do you think?
>
> Antoni
>
>
>
> Christiaan Fluit pisze:
>   
>> I'm not a SOLR expert, so I hope I am not saying anything stupid here.
>>
>> Part of the beauty of SOLR as I understand it is that it only provides 
>> functionalities directly related to Lucene and nothing more. This 
>> enables very flexible deployment strategies.
>>
>> Rather than integrating Aperture in SOLR, I would rather create a 
>> similar server layer around Aperture that can do crawling and extraction 
>> and has one or more output handlers (basically a CrawlerHandler 
>> implementation) that determine what is to be done with the results. One 
>> of these plugins could post it to a SOLR server.
>>
>> One reason for this: Aperture crawling may have different requirements 
>> than the app that will use the extracted metadata. For example: Aperture 
>> can crawl an enterprise file server but SOLR will redirect users to 
>> those files via a web server. Separating the two functionalities will 
>> let you deploy SOLR "a few firewalls away" - this will keep a lot of 
>> sysadmins happy. Also, crawling and extraction is very CPU-intensive: 
>> for large datasets you don't want to run that on the same machine as 
>> your front-end.
>>
>> Finally, SOLR is just one output medium, there may be a lot more, so why 
>> limit ourselves by design to SOLR?
>>
>> Regards,
>>
>> Chris #3 :)
>> --
>>
>> Antoni Mylka wrote:
>>     
>>> Hi,
>>>
>>> The "product" would have to be an "Aperture Server", which would be a
>>> distribution of SOLR, with the Aperture integration bundled and enabled
>>> by default. Or? It would be a very simple task if
>>>  - we learn a bit about SOLR architecture and packaging (or find someone
>>>  who'd be interested in this)
>>>  - and have any necessary code that glues Aperture With SOLR written and
>>> tested
>>>
>>> Antoni
>>>
>>> Leo Sauermann pisze:
>>>       
>>>> Hi
>>>>
>>>> Ok, so crawling many datasources is exactly "spot on" for aperture,
>>>> how about an aperture server software that makes the whole thing a
>>>> "product" and not just a "library"?
>>>>
>>>> best
>>>> Leo
>>>>
>>>> It was Antoni Mylka who said at the right time 21.12.2009 12:10 the
>>>> following words:
>>>>         
>>>>> Leo, Aperturians
>>>>>
>>>>> The idea is great, if there is more interest in using Aperture in SOLR
>>>>> then we could expand in this direction. What is needed is feedback what
>>>>> data sources would you like to use with SOLR (or use already). If there
>>>>> is need, we could think about expanding in that direction.
>>>>>
>>>>> Some ideas from the top of my head:
>>>>>
>>>>> - elevate the MimeSubCrawler into an mbox subcrawler, that would crawl
>>>>> plain text mailing list archives properly, or make proper mbox crawling
>>>>> results appear in the FileSystemCrawler output
>>>>> - extend the output of flickr/delicious/bibsonomy subcrawlers, let them
>>>>> extract photo comments, or publication abstracts
>>>>> - think about some SambaCrawler or FTPCrawler or pick up the abandoned
>>>>> WebdavCrawler that that would crawl remote folders without having to
>>>>> mount them locally first
>>>>> - LdapCrawler? XMLDbCrawler?
>>>>>
>>>>> sky is the limit
>>>>>
>>>>> Antoni
>>>>>
>>>>> Leo Sauermann pisze:
>>>>>   
>>>>>           
>>>>>> Hi Aperturians,
>>>>>>
>>>>>> in the organik-project.eu, and for other things, it would be awesome to
>>>>>> have an "aperture crawling server" that is an installable war file that
>>>>>> just crawls some configured datasources and then does something with the
>>>>>> crawled rdf.
>>>>>>
>>>>>> for example, we have this company that has a fileshare on a windows
>>>>>> server. Now I would like to install "aperture crawling server" on some
>>>>>> machine as WAR file, instruct it using a web-interface to crawl a
>>>>>> datasource "windows share" (and maybe some internal websites using the
>>>>>> webcrawler and maybe some newsletter using imap) and off it goes to do this.
>>>>>> Then I would like to configure the "SOLR crawling handler" or the
>>>>>> "drupal" crawling handler to tell the aperture crawling server what to
>>>>>> do with the RDF.
>>>>>>
>>>>>> I know that the people having done the SOLR integration of Aperture
>>>>>> probably did something exactly like this - do we have some open source
>>>>>> code now available for it?
>>>>>>
>>>>>> have you written an aperture-based crawling server that you can share
>>>>>> with me?
>>>>>> (doesn't have to be completly open source, but only looking at the code
>>>>>> and the architecture could teach me a lot)
>>>>>>
>>>>>> would you think this is a rocking idea and join a new subproject in
>>>>>> Aperture for this?
>>>>>> (just say yes and then we go on from this, forming a group with
>>>>>> requirements, etc)
>>>>>>
>>>>>> has anyone written a windows fileshare-crawler? What java libraries are
>>>>>> there to crawl windows/samba shares?
>>>>>> (that would be awesome)
>>>>>>
>>>>>> best
>>>>>> Leo
>>>>>>
>>>>>>     
>>>>>>             


-- 
_____________________________________________________
Dr. Leo Sauermann       http://www.dfki.de/~sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +43 6991 gnowsis
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo...@df...

Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
_____________________________________________________