From: Aaron B. <aa...@ar...> - 2011-03-05 21:49:48
|
Gerard Suades i Méndez <gs...@ce...> writes: > 1.- We have a new set of ARC that we would like to include in full > text search. We were wondering if there is any special procedure to > update the already existing NutchWAX indexes with the new crawls. Any > idea for the merge process? Do we need to keep segments of old crawls > in order to generate the indexes of the new crawls before merging all > together? Yes, for *building* the indexes you need to keep the segments, only for the TNH search service you don't need the segments as the index has all the information in it needed for search services. There are basically two ways to merge indexes, which one you choose depends on your de-duplication strategy. If you have two Lucene indexes A and B, you can just use the IndexMerger command in TNH to merge them together. TNH provides a simple command-line wrapper around the Lucene index merging API call. Since TNH is a webapp, you have to un-jar it to be able to use the Java command-line wrappers, for example $ mkdir tnh $ cd tnh $ jar xf ~/tnh.war $ export CLASSPATH=WEB-INF/classes:WEB-INF/lib/lucene-core-*.jar $ java IndexMerger <merged> <index-A> <index-B> This simply calls the Lucene library index-merge function, so it does *not* know anything about de-duplication. If you have the same record in both index A and index B, then you will have them both in the merged index. So, if you already have an index for your existing collection, then get some new (W)ARC files, you and index those separately and then merge the two indexes together. Another approach is to re-build the entire index, giving as inputs the initial NutchWAX segments and the new NutchWAX segment for the new (W)ARCs. Then, you will have one single index with everything in it. In this case, any duplicate records can be detected and merged when the combined index is being built. The merging of duplicate records during index-building was a feature put into a minor revision of NutchWAX 0.13. I'll have to look up the specific SVN revision. With regards to indexing, there is a side-project of mine similar to TNH which does a better job of index-building than NutchWAX. This project is called "The JBs", which was the name of the band for the famous musician James Brown. One of the many improvements in The JBs does is "accented letter collapsing" so that words with accented characters are indexed so that they can be found with or without the accent mark. For example, Méndez with NutchWAX it is put into the index exactly as "Méndez". If someone searches for "Mendez", it will not be found. But if the index is built with then both "Méndez" and "Mendez" can be found. The JBs also performs merging of duplicates when building a single index from multiple NutchWAX segments. But, this email is getting rather long already, with more below, so I will conclude this section on The JBs. We can discuss further if you are interested. > 2.- The size of the index which self-contained the segments > information is a linear growth size related to the ARC? at this moment > index represents pretty much 7.5% of the whole collection ARCs size. It depends on the mix of file types in the original ARC files. Only text types are put into the full-text search, so things like JPG, MP3, AVI, ZIP, etc. are omitted. You're 7.5% number does not seem unusual to me. In our full-text search for Archive-It.org, there are just over 1 billion documents in the index and the on-disk index size is ~3.5TB, and the size of all the (W)ARC files is somewhere around 100TB. But I know there are lots of large binary files, including lots of YouTube video in the Archive-It collection. > 3.- Is it possible to install TNH in several tomcats sharing the same > index? in other words, does TNH block index while searching as Wayback > used to? I don't remember if that specific use-case was tested. It should work. TNH is built on Lucene and when TNH opens the index, it uses the Lucene API call to open the index in read-only mode; so there should be no exclusive locking and multiple TNH web application instances should be able to open the same index. However, TNH and the Lucene library do cache parts of the index in memory, so if you have multiple instances of the TNH web appliction, you will have multiple instances of the caches as well. An alternative approach might be to use a multi-index setup in a single TNH instance and use the "i=<indexname>" URL parameter to select which index to search. Maybe you can describe what you are trying to do with multiple TNH webapp instances reading the same index and I can provide some suggestions on how to implement it. > 4.- Based on the results of our tests we are thinking of using TNH for > full text search instead of WERA. Is there any roadmap or a major > release planned for the future? No, there isn't any roadmap. Well, the roadmap is to migrate everything to Apache SOLR, which merged projects with Lucene last year and is now considered *the* open-source full-text search platform. Unfortunately, there are some features missing from SOLR which are required for full-text search on web archives. Also, we don't know yet how SOLR will scale, especially in a multi-server configuration. I produced a report for the IIPC covering the issues with migrating from NutchWAX to SOLR. http://archive.org/~aaron/iipc/ So, that leaves us in an intermediate state where NutchWAX's search service performance is not sufficient, but SOLR is not quite ready for full-scale migration. The Internet Archive needs to decide if we commit to supporting TNH (with an official release) as an intermediate step in the migration path to SOLR. And if people are finding TNH useful and an adequate replacement for the NutchWAX search service, then we would have a stronger case to commit the resources to support an official TNH release. -- Aaron Binns Senior Software Engineer, Web Group, Internet Archive Program Officer, IIPC aa...@ar... |