From: Darren G. <da...@on...> - 2009-01-27 13:15:36
|
Hi, I'm new to aperture and think its great. I wanted to know if I scan individual files with aperture and get their individual RDF metadata, if I combine those triples in a triple-store if the result would be equal to the single RDF model generated as if Aperture crawled all the files? The reason I ask is that I'm looking to do distributed processing of files and can't crawl them with the same thread. thanks for any tips Darren |
From: Antoni M. <ant...@gm...> - 2009-01-28 16:32:59
|
Darren Govoni pisze: > Hi, > I'm new to aperture and think its great. I wanted to know if I scan > individual files with aperture and get their individual RDF metadata, if > I combine those triples in a triple-store if the result would be equal > to the single RDF model generated as if Aperture crawled all the files? Probably. The FilesystemCrawler will also get you the folders, thus replicating the entire folder structure, but if you only need the files then you can call the FileAccessor directly on each file you need, and then process the DataObject with an appropriate extractor. That's what the FilesystemCrawler does anyway. It's really simple, only 300 lines of code (mostly commented-out), most of the actual meat happens inside the FileAccessor and the Extractors, and you can use those without the crawler. > The reason I ask is that I'm looking to do distributed processing of > files and can't crawl them with the same thread. That's true. The crawlers are essentially single-threaded. You should process all returned DataObjects on the same thread. Otherwise things get fiendishly complicated (connection management in IMAP, the entire subcrawler stack with files within files within files etc). If you must process files on different threads, the crawler won't work. What you can do though, is to have many crawlers on many threads crawling different portions of the folder tree. The crawler has some features (like handling folders with 100K files, or discarding symbolic links) you might not want to have to reinvent. Hope this helps. Antoni Myłka ant...@gm... |
From: Leo S. <leo...@df...> - 2009-01-28 16:37:50
|
2 more bits: look into the example code, there is a fileinspector which shows a lot. if you want to do hardcore multi-cpu clustered indexing of > 1 mio files, you should peek into the thing the SMILA folks do with aperture: http://www.eclipse.org/smila best Leo It was Antoni Myłka who said at the right time 28.01.2009 17:29 the following words: > Darren Govoni pisze: > >> Hi, >> I'm new to aperture and think its great. I wanted to know if I scan >> individual files with aperture and get their individual RDF metadata, if >> I combine those triples in a triple-store if the result would be equal >> to the single RDF model generated as if Aperture crawled all the files? >> > > Probably. The FilesystemCrawler will also get you the folders, thus > replicating the entire folder structure, but if you only need the files > then you can call the FileAccessor directly on each file you need, and > then process the DataObject with an appropriate extractor. > > That's what the FilesystemCrawler does anyway. It's really simple, only > 300 lines of code (mostly commented-out), most of the actual meat > happens inside the FileAccessor and the Extractors, and you can use > those without the crawler. > > >> The reason I ask is that I'm looking to do distributed processing of >> files and can't crawl them with the same thread. >> > > That's true. The crawlers are essentially single-threaded. You should > process all returned DataObjects on the same thread. Otherwise things > get fiendishly complicated (connection management in IMAP, the entire > subcrawler stack with files within files within files etc). If you must > process files on different threads, the crawler won't work. What you can > do though, is to have many crawlers on many threads crawling different > portions of the folder tree. The crawler has some features (like > handling folders with 100K files, or discarding symbolic links) you > might not want to have to reinvent. > > Hope this helps. > > Antoni Myłka > ant...@gm... > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel > > -- ____________________________________________________ DI Leo Sauermann http://www.dfki.de/~sauermann Deutsches Forschungszentrum fuer Kuenstliche Intelligenz DFKI GmbH Trippstadter Strasse 122 P.O. Box 2080 Fon: +49 631 20575-116 D-67663 Kaiserslautern Fax: +49 631 20575-102 Germany Mail: leo...@df... Geschaeftsfuehrung: Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 ____________________________________________________ |
From: Darren G. <da...@on...> - 2009-01-28 16:48:01
|
Thank you the suggestions. I have a distributed set of computers working on separate files. So my thought was to extract the RDF on each file it works and store the RDF in a central database like openrdf/Sesame, where I hope all the triples will re-create a single model as if crawled at once (i.e. no loss of semantic relations). So this would mean that the crawler does not do any backtracking on its crawled files to infer new relationships? Otherwise, getting the RDF individually and re-combining may not produce the same RDF model. I will have to try this and see what it produces. Darren On Wed, 2009-01-28 at 17:29 +0100, Antoni Myłka wrote: > Darren Govoni pisze: > > Hi, > > I'm new to aperture and think its great. I wanted to know if I scan > > individual files with aperture and get their individual RDF metadata, if > > I combine those triples in a triple-store if the result would be equal > > to the single RDF model generated as if Aperture crawled all the files? > > Probably. The FilesystemCrawler will also get you the folders, thus > replicating the entire folder structure, but if you only need the files > then you can call the FileAccessor directly on each file you need, and > then process the DataObject with an appropriate extractor. > > That's what the FilesystemCrawler does anyway. It's really simple, only > 300 lines of code (mostly commented-out), most of the actual meat > happens inside the FileAccessor and the Extractors, and you can use > those without the crawler. > > > The reason I ask is that I'm looking to do distributed processing of > > files and can't crawl them with the same thread. > > That's true. The crawlers are essentially single-threaded. You should > process all returned DataObjects on the same thread. Otherwise things > get fiendishly complicated (connection management in IMAP, the entire > subcrawler stack with files within files within files etc). If you must > process files on different threads, the crawler won't work. What you can > do though, is to have many crawlers on many threads crawling different > portions of the folder tree. The crawler has some features (like > handling folders with 100K files, or discarding symbolic links) you > might not want to have to reinvent. > > Hope this helps. > > Antoni Myłka > ant...@gm... > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel |
From: Antoni M. <ant...@gm...> - 2009-01-28 17:25:12
|
Darren Govoni pisze: > Thank you the suggestions. > > I have a distributed set of computers working on separate files. So my > thought was to extract the RDF on each file it works and store the RDF > in a central database like openrdf/Sesame, where I hope all the triples > will re-create a single model as if crawled at once (i.e. no loss of > semantic relations). > > So this would mean that the crawler does not do any backtracking on its > crawled files to infer new relationships? Otherwise, getting the RDF > individually and re-combining may not produce the same RDF model. > > I will have to try this and see what it produces. > > > Darren Just to clear things, Aperture can extract basic file metadata (size, lastModified date) and the metadata stored inside the file (like in a word document - the name of the author, keywords etc.). It doesn't try to infer any abstract "semantic" relationships like "these files refer to a single project", or "this photo depicts Pyramids of Egypt". Also, no relations between files are extracted (like these files are "similar", or these files have the same content). If you need this kind of processing you will need to use other tools that work on top of the metadata extracted by Aperture. That's why extracting RDF from individual files by yourself will bring you more-or-less the same information you would get if you had used a crawler. The NEPOMUK project explored these ideas with its "Local data alignment" components: http://dev.nepomuk.semanticdesktop.org/wiki/LocalDataAlignment You could find more info - and pointers to other tools - on the NEPOMUK wiki. Hope this helps. Antoni Myłka ant...@gm... |