Thread: [Archive-access-discuss] Env Setup Help for Large Set of ARCs

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss

[Archive-access-discuss] Env Setup Help for Large Set of ARCs

From: Alex Wu <aw...@sd...> - 2006-09-26 19:32:49

Hi,

We have a project with about 48000 ARC files, and would like inputs  
on the best way to implement the wayback machine 0.6.0

Our setup is Tomcat 5.5.17, JDK 1.5, 1GB memory for JVM. We have only  
6000 ARCs indexed at this point over a 1 week period. We would like  
to increase this rate significantly.


Some questions we have are:

1. Suggested environment setup for this number of ARC files and greater.

2. Parallel indexing option for the current version or additional  
tools that will allow for this.

3. The index is tied to the machine name. How to avoid this.

4. Is it possible to have multiple wayback installations, each with  
its own JVM, use the same arc files and/or index.

5. The user manual at http://archive-access.sourceforge.net/projects/ 
wayback/user_manual.html mentions a non-LocalBDBResourceIndex  
resource implementation that communicates with a remote wayback  
installation. The user manual does not cover the preparation of the  
index data. What are the steps for this setup, including index data  
preparation.

6. Is there a limitation to the number of ARCs wayback will handle.


Thank you for your input.

Alex Wu
858-534-5074

Re: [Archive-access-discuss] Env Setup Help for Large Set of ARCs

From: Brad T. <br...@ar...> - 2006-09-26 20:20:28

Hi Alex,

Good questions, all of them. First off, your collection is larger than 
any collection we've implemented using the current WM, but we are in the 
process, right now, of creating an installation of about 5TB, or about 
50K ARCs, so you're not completely out in front of the crowd.

Firstly, the BDBJE has performance issues at larger scales when 
inserting in random order, both in insert, and in subsequent lookup. We 
haven't yet done serious performance analysis on this. Our solution has 
been to externally sort the index data. This makes insert linear in 
performance, and lookup performance has been good on BDBJE's created 
this way(see answer to #2 below for a few more hints on implementing 
this, or the online User Manual in the near future).

I'll add some notes on how we've been implementing this to the User Manual.

0.8.0, which will hopefully be available soon, will include modules for 
distributing an index across multiple nodes, in alphabetic regions. This 
code is mostly done now, but is not checked in. 0.8.0 will also include 
several new Index related features, including: capability to use sorted 
flat files as a Wayback index (which will allow external sort tools to 
be used to generate the index, long term(1.0.0) we're planning on using 
Hadoop for this) capability to merge results found from multiple index 
sources, which could involve multiple sorted flat files, and a BDBJE, 
for example. We expect that the combination of these features will allow 
indexes of arbitrarily large sizes to be created and searched efficiently.

Today, 48K ARCs is pushing the edge. I can probably do a check in in the 
next few days of most of the functionality I've described above, if 
you're interested in helping to test this new software.

Specific answers to your questions below.

Alex Wu wrote:
> Hi,
>
> We have a project with about 48000 ARC files, and would like inputs on 
> the best way to implement the wayback machine 0.6.0
>
> Our setup is Tomcat 5.5.17, JDK 1.5, 1GB memory for JVM. We have only 
> 6000 ARCs indexed at this point over a 1 week period. We would like to 
> increase this rate significantly.
>
>
> Some questions we have are:
>
> 1. Suggested environment setup for this number of ARC files and greater.
>
Your current setup should be fine for this, but when the distributed 
index option is available, it would be advisable to move to this 
configuration.

> 2. Parallel indexing option for the current version or additional 
> tools that will allow for this.
>

The pipeline-client command line tool has a new option to generate a 
flat-file version of the index data on STDOUT. This process could be 
executed in parallel across multiple nodes, and their outputs sorted, 
and merged together to form a single flat-file. This flat-file can be 
used today with the BDBJE option, by manually placing the file into the 
"toBeMerged" directory on the host holding the index. We've seen 
acceptable performance inserting large sorted files in this manner.

With the new flat-file binary searching ResourceIndex code, this sorted 
flat-file could be used as-is, bypassing the BDBJE altogether. I'll let 
you know when it's checked in.
> 3. The index is tied to the machine name. How to avoid this.
>

Not sure what you mean. Do you mean there is data internal to the BDBJE 
that is aware of the host where it was created and cannot be used on 
other hosts? Can you elaborate?

> 4. Is it possible to have multiple wayback installations, each with 
> its own JVM, use the same arc files and/or index.
>
Yes. We have a couple of installations that include front end UIs for 
Proxy, Timeline, and Archival URL replay modes on top of the same index, 
where each installation uses a RemoteCDXIndex. I'll add some 
documentation to the User Manual outlining this configuration in the 
next day or two.

> 5. The user manual at 
> http://archive-access.sourceforge.net/projects/wayback/user_manual.html 
> mentions a non-LocalBDBResourceIndex resource implementation that 
> communicates with a remote wayback installation. The user manual does 
> not cover the preparation of the index data. What are the steps for 
> this setup, including index data preparation.
>
As mentioned in #4, I'll outline this configuration in the User Manual, 
but the basics: set up one webapp with a LocalBDBResourceIndex, making 
sure it has a QueryUI with the QueryXMLUI jsps set up. This will allow 
HTTP-XML queries of the index. Then you set up one or more webapps, 
using whatever replay modes you prefer, using the RemoteCDXIndex 
ResourceIndex implementation to connect to the HTTP-XML exported 
ResourceIndex.

> 6. Is there a limitation to the number of ARCs wayback will handle.
>
With the 0.8.0 features, we expect the WM to be able to scale to 
arbitrarily large numbers of ARC files. Generating indexes for larger 
installations will be handled offline, and will be a manual process 
until the 1.0.0 release.

Thanks for the feedback and questions. We're very interested in your 
experiences and making this software as easy to use as possible.

Brad
>
> Thank you for your input.
>
> Alex Wu
> 858-534-5074
>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

Re: [Archive-access-discuss] Env Setup Help for Large Set of ARCs

From: Alex Wu <aw...@sd...> - 2006-10-24 19:06:20

Hi Brad,

Another update. Thank you for the input.

We modified org.archive.wayback.cdx.RemoteCDXIndex.java to search  
multiple remote indexes. One wayback instance (the frontend) has been  
configured to use RemoteCDX in the web.xml and the other wayback  
instances are using LocalBDB configuration.


[code block]
private String sdscSearchUrlBases[] = {"http://machine:9000/wayback/ 
xmlquery", "http://machine:9002/wayback/xmlquery", ...}

public SearchResults query(WaybackRequest wbRequest) {
...
       SearchResults searchResults = new SearchResults();
                 for(int i=0;i<sdscSearchUrlBases.length;i++) {
                         try {
				doc = queryOneUrl(sdscSearchUrlBases[i], wbRequest);
       ...
       }
...
}
[end code block]







On Oct 18, 2006, at 2:58 PM, Brad Tofel wrote:

> Sorry for the delayed response -- lots of balls in the air right now..
>
> I just did a large check in of the software the supports sorted  
> flat files, but will not have time to update the docs for another  
> week or so. There are some comments in the code and in the new  
> web.xml, (which has changed pretty significantly) that might be  
> enough to make sense of how to use the new functionality.
>
>
> Currently, there is no support for querying multiple remote  
> indexes, but it seems like this should be relatively  
> straightforward, in the good-guy case, using the new software:  
> you'd just need to make a RemoteSearchResultSource out of the  
> RemoteResourceIndex, and modify the SearchResultSourceFactory to  
> build a composite from several of them...
>
> I say "in the good-guy case" because the failure modes might get  
> complicated in terms of timeouts, failed connections, etc. However,  
> if your hardware is stable, then the "easy solution" I outlined  
> might be good enough.
>
> I'll drop you another line when the documentation has been updated.
>
> Brad
>
> Bing Zhu wrote:
>> Dear Mr. Brad,
>>
>> This is Bing Zhu from University of California: San Diego.
>>
>> We really appreciate your time to put the answers for our questions.
>>
>> Is it possible for a Wayback machine to query multiple index  
>> sources (e.g. index info
>> in multiple Wayback machines) when using RemoteCDXIndex ? If yes,  
>> would you
>> let us how to do so? Many thanks.
>>
>> Sincerely,
>> Bing
>>
>>
>>
>>

Alex Wu
858-534-5074

[Archive-access-discuss] Update: Env Setup Help for Large Set of ARCs

From: Alex Wu <aw...@sd...> - 2006-10-10 19:11:16

Hi Brad,

Thank you for your input. Wanted to give an update on our experience  
with the wayback application within Tomcat.

We tried one setup, where on one machine, we ran 12 instances of the  
wayback application, each in it's own Tomcat container, and gave  
about 2,700 ARC files for each instance. Each Tomcat was allocated  
1GB memory. This was done over the weekend, and over 30,000 ARCs were  
processed.

Another setup was tried on the same machine, where 3 tomcat instances  
were run, each with 6 wayback applications. Each wayback application  
handles 2,700 ARC files. Each tomcat was allocated 1 to 3 GB memory.  
Within the instance of Tomcat with 3GB allocated, the result in about  
48 hours was just over 3,000 ARCs processed. The other two tomcat  
instances are mostly idle, having indexed/merged their respective set  
of ARCs almost completely over the weekend.

We are experimenting with different setups that involve many  
variables, such as the varying size of ARC files, non-wayback load on  
the machine, etc., so it's difficult to give a more accurate  
performance comparison without controlling the variables more.

We modified slightly org/archive/wayback/cdx/indexer/ 
IndexPipeline.class so that the indexing, queuing, and merging are  
running in separate threads, and sleeping at different intervals. And  
with this, 3 indexing threads are running.

Lastly, I was not able to view the CVS at http:// 
crawltools.archive.org:8080/cruisecontrol/buildresults/HEAD-archive- 
access. "Firefox can't establish a connection to the server at  
crawltools.archive.org:8080."

Thank you again,
Alex


> Hi,
>
> We have a project with about 48000 ARC files, and would like inputs  
> on the best way to implement the wayback machine 0.6.0
>
> Our setup is Tomcat 5.5.17, JDK 1.5, 1GB memory for JVM. We have  
> only 6000 ARCs indexed at this point over a 1 week period. We would  
> like to increase this rate significantly.
>
>
> Some questions we have are:
>
> 1. Suggested environment setup for this number of ARC files and  
> greater.
>
>
Your current setup should be fine for this, but when the distributed  
index option is available, it would be advisable to move to this  
configuration.


> 2. Parallel indexing option for the current version or additional  
> tools that will allow for this.
>
>

The pipeline-client command line tool has a new option to generate a  
flat-file version of the index data on STDOUT. This process could be  
executed in parallel across multiple nodes, and their outputs sorted,  
and merged together to form a single flat-file. This flat-file can be  
used today with the BDBJE option, by manually placing the file into  
the "toBeMerged" directory on the host holding the index. We've seen  
acceptable performance inserting large sorted files in this manner.

With the new flat-file binary searching ResourceIndex code, this  
sorted flat-file could be used as-is, bypassing the BDBJE altogether.  
I'll let you know when it's checked in.

> 3. The index is tied to the machine name. How to avoid this.
>
>

Not sure what you mean. Do you mean there is data internal to the  
BDBJE that is aware of the host where it was created and cannot be  
used on other hosts? Can you elaborate?


> 4. Is it possible to have multiple wayback installations, each with  
> its own JVM, use the same arc files and/or index.
>
>
Yes. We have a couple of installations that include front end UIs for  
Proxy, Timeline, and Archival URL replay modes on top of the same  
index, where each installation uses a RemoteCDXIndex. I'll add some  
documentation to the User Manual outlining this configuration in the  
next day or two.



> 5. The user manual at http://archive-access.sourceforge.net/ 
> projects/wayback/user_manual.html mentions a non- 
> LocalBDBResourceIndex resource implementation that communicates  
> with a remote wayback installation. The user manual does not cover  
> the preparation of the index data. What are the steps for this  
> setup, including index data preparation.
>
>
As mentioned in #4, I'll outline this configuration in the User  
Manual, but the basics: set up one webapp with a  
LocalBDBResourceIndex, making sure it has a QueryUI with the  
QueryXMLUI jsps set up. This will allow HTTP-XML queries of the  
index. Then you set up one or more webapps, using whatever replay  
modes you prefer, using the RemoteCDXIndex ResourceIndex  
implementation to connect to the HTTP-XML exported ResourceIndex.


> 6. Is there a limitation to the number of ARCs wayback will handle.
>
>
With the 0.8.0 features, we expect the WM to be able to scale to  
arbitrarily large numbers of ARC files. Generating indexes for larger  
installations will be handled offline, and will be a manual process  
until the 1.0.0 release.

Thanks for the feedback and questions. We're very interested in your  
experiences and making this software as easy to use as possible.