|
From: Brad T. <br...@ar...> - 2006-09-26 20:20:28
|
Hi Alex, Good questions, all of them. First off, your collection is larger than any collection we've implemented using the current WM, but we are in the process, right now, of creating an installation of about 5TB, or about 50K ARCs, so you're not completely out in front of the crowd. Firstly, the BDBJE has performance issues at larger scales when inserting in random order, both in insert, and in subsequent lookup. We haven't yet done serious performance analysis on this. Our solution has been to externally sort the index data. This makes insert linear in performance, and lookup performance has been good on BDBJE's created this way(see answer to #2 below for a few more hints on implementing this, or the online User Manual in the near future). I'll add some notes on how we've been implementing this to the User Manual. 0.8.0, which will hopefully be available soon, will include modules for distributing an index across multiple nodes, in alphabetic regions. This code is mostly done now, but is not checked in. 0.8.0 will also include several new Index related features, including: capability to use sorted flat files as a Wayback index (which will allow external sort tools to be used to generate the index, long term(1.0.0) we're planning on using Hadoop for this) capability to merge results found from multiple index sources, which could involve multiple sorted flat files, and a BDBJE, for example. We expect that the combination of these features will allow indexes of arbitrarily large sizes to be created and searched efficiently. Today, 48K ARCs is pushing the edge. I can probably do a check in in the next few days of most of the functionality I've described above, if you're interested in helping to test this new software. Specific answers to your questions below. Alex Wu wrote: > Hi, > > We have a project with about 48000 ARC files, and would like inputs on > the best way to implement the wayback machine 0.6.0 > > Our setup is Tomcat 5.5.17, JDK 1.5, 1GB memory for JVM. We have only > 6000 ARCs indexed at this point over a 1 week period. We would like to > increase this rate significantly. > > > Some questions we have are: > > 1. Suggested environment setup for this number of ARC files and greater. > Your current setup should be fine for this, but when the distributed index option is available, it would be advisable to move to this configuration. > 2. Parallel indexing option for the current version or additional > tools that will allow for this. > The pipeline-client command line tool has a new option to generate a flat-file version of the index data on STDOUT. This process could be executed in parallel across multiple nodes, and their outputs sorted, and merged together to form a single flat-file. This flat-file can be used today with the BDBJE option, by manually placing the file into the "toBeMerged" directory on the host holding the index. We've seen acceptable performance inserting large sorted files in this manner. With the new flat-file binary searching ResourceIndex code, this sorted flat-file could be used as-is, bypassing the BDBJE altogether. I'll let you know when it's checked in. > 3. The index is tied to the machine name. How to avoid this. > Not sure what you mean. Do you mean there is data internal to the BDBJE that is aware of the host where it was created and cannot be used on other hosts? Can you elaborate? > 4. Is it possible to have multiple wayback installations, each with > its own JVM, use the same arc files and/or index. > Yes. We have a couple of installations that include front end UIs for Proxy, Timeline, and Archival URL replay modes on top of the same index, where each installation uses a RemoteCDXIndex. I'll add some documentation to the User Manual outlining this configuration in the next day or two. > 5. The user manual at > http://archive-access.sourceforge.net/projects/wayback/user_manual.html > mentions a non-LocalBDBResourceIndex resource implementation that > communicates with a remote wayback installation. The user manual does > not cover the preparation of the index data. What are the steps for > this setup, including index data preparation. > As mentioned in #4, I'll outline this configuration in the User Manual, but the basics: set up one webapp with a LocalBDBResourceIndex, making sure it has a QueryUI with the QueryXMLUI jsps set up. This will allow HTTP-XML queries of the index. Then you set up one or more webapps, using whatever replay modes you prefer, using the RemoteCDXIndex ResourceIndex implementation to connect to the HTTP-XML exported ResourceIndex. > 6. Is there a limitation to the number of ARCs wayback will handle. > With the 0.8.0 features, we expect the WM to be able to scale to arbitrarily large numbers of ARC files. Generating indexes for larger installations will be handled offline, and will be a manual process until the 1.0.0 release. Thanks for the feedback and questions. We're very interested in your experiences and making this software as easy to use as possible. Brad > > Thank you for your input. > > Alex Wu > 858-534-5074 > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |