You can subscribe to this list here.
| 2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
| 2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
| 2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
| 2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
| 2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
| 2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
| 2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
| 2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
| 2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
|
From: Ignacio G. <igc...@gm...> - 2007-10-31 14:03:27
|
Hello Brad, everyone, I have been playing around with Wayback 1.0 for a couple of weeks, since it got released and here is a list of my comments, questions and issues. I will start by saying that I really like the changes that have been made, specially in the configuration aspect of the tool. It is now much easier to configure, to understand what each section does and set up the environment. I have been able to set up several AccessPoints (3) that access different collections (3) and they all seem to work as expected. They are set up on port 8088, so changing the port is not an issue and can be done easily, using the AccessPoint configuration. All three collections use CDX indexes, so this also works perfectly. However, I was only able to make Wayback work using version 1.0.0 under the ROOT context. I downloaded and tried version 1.0.1 but it did not start due to errors in the configuration (even using the default set up). I do not think that using the ROOT context is a big issue, since the AccessPoints provide path control and differentiation, but it wold be good if we could deploy Wayback under different contexts. Also, I have found that if you try to access an AccessPoint location without the trailing slash '/' it will not work. A Not-Found (404) error is displayed instead. This means that typing: http://xyz.com/myCollection/ displays the Wayback interface successfully, but using http://xyz.com/myCollection will not. I do not know if this is something that should be corrected in the server configuration and it is not a Wayback issue, but I thought I should let you know. My next comments are regarding the exclusion and restriction mechanisms. Have in mind that I am using version 1.0.0, so I do not know if a working 1.0.1 has this issues resolved. I was able to successfully implement an IP-based restriction on one of my collections, and it did block content for all IPs outside of the specified range. However, I had some problems when trying to specify more than one <value> element to the IP <list>. I wanted to use two IP ranges, and there were some issues. I will have to test this more extensively, because it might be a problem of Wayback not updating properly after a simple restart. I also tried to implement an static exclusion using a plain text file and I have to say that I was not able to make this work at all. I added this code section to my wayback.xml file. It was by itself, outside any AccessPoint or Collection. <bean name="2004-exclusion-list" class=" org.archive.wayback.accesscontrol.staticmap.StaticMapExclusionFilterFactory "> <property name="file" value="/vol/webcapture/wayback_indexes/el2004/exclude.txt" /> <property name="checkInterval" value="10" /> </bean> Then, inside the desired AccessPoint, I added the following: <property name="exclusionFactory" ref="2004-exclusion-list" /> The Catalina log does not show any information regarding Wayback accessing the file, so I believe that the configuration file parsed correctly, but it chose to ignore the exclusion and that is why it is not being applied. My last question has to do with the integration of this two exclusion/restriction mechanisms. In some of my AccessPoints, I would like to be able to block some URLs, but only to those users that are outside of the range provided. Will I have to create two AccessPoints, one with the IP restriction that will allow users to view the complete collection, and then a different one that will block the contents for everyone or can I put the together in a single AccessPoint? Since I could not implement the static exclusion I was not able to test if this properties could be nested one inside the other, but I think that this would be a very important option. Otherwise, we would have to implement server-side redirection based on IP addresses to point users to the correct AccessPoint, and that would eliminate most of the benefit of integrating IP recognition inside Wayback. This is what I have experienced up to this point. I will keep testing other aspects that we might use and report back with my findings. Thank you. |
|
From: Martin B. <xb...@fi...> - 2007-10-27 20:53:16
|
Hi, I have managed to index documents using nutchwax in distributed mode for several times. But now there is a problem i can not cope with. This time not all computers are under the same domain. Machine which hosts namenode and jobtracker is under domain webarchiv.cz, while all datanodes are under fi.muni.cz (btw all computers are in the same building). When the 5th job starts (dedup 1: urls by time) 'info' massages are combined with 'warn' ones in the logs like these: jobtracker: INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0005_m_000003_3: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:109) at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:177) at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:203) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1388) datanodes: WARN org.apache.hadoop.dfs.DataNode: DataXCeiver java.io.IOException: Block blk_-7402203219236206647 has already been started (though not co mpleted), and thus cannot be created. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:437) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:721) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:550) at java.lang.Thread.run(Thread.java:619) WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_-7402203219236206647 to nymfe01/147.251.53.11:50010 java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:974) at java.lang.Thread.run(Thread.java:619) WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_-5576786832054029538 to nymfe05/147.251.53.15:50010 java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:974) at java.lang.Thread.run(Thread.java:619) Then it crashes and my terminal says: 07/10/27 22:20:40 INFO indexer.DeleteDuplicates: Dedup: adding indexes in: output/indexes 07/10/27 22:20:43 INFO mapred.JobClient: Running job: job_0005 07/10/27 22:20:44 INFO mapred.JobClient: map 0% reduce 0% 07/10/27 22:21:05 INFO mapred.JobClient: map 100% reduce 100% Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:433) at org.archive.access.nutch.Nutchwax.doDedup(Nutchwax.java:257) at org.archive.access.nutch.Nutchwax.doAll(Nutchwax.java:156) at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:389) at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) Can anybody help? Thanks, Martin Bella |
|
From: Ignacio G. <igc...@gm...> - 2007-10-26 19:36:28
|
Hello all, I was able to run the indexing job to completion thanks to the config help and pointers you provided. As Andrea pointed out, the problem was based on the PermGem, and after increasing the limits using the HADOOP_OPTS I had no problems to execute and run the process. However, now that everything is done, I am alarmed by the result. I was indexing a collection of aprox. 107 Gb, with over 100 million documents, and the size of the index, including segments and everything is larger than the collection itself (122Gb). Also, while indexing, the size of the temporary data used by the process reached the 200Gb limit, which is almost twice the size of the entire collection. Do anyone of you have any data that shows the relation between the collection size and the final size of the index? Is there any way to reduce the final size? I used the ALL option when invoking the nutchwax process, so every step of the process was executed. I did not leave out any of the steps (reduce, map, dedup...), so I am confused. My next test would involve an 800Gb collection, but I don't even want to try it if the resulting index is not going to be at least 5 to 10 times smaller than the source. Thanks. On 10/17/07, Michael Stack <st...@du...> wrote: > > Ignacio Garcia wrote: > > Hello Michael, > > > > Where can I find the tasktracker log?? Is it under hadoop? nutchwax? > > or in a temp location? > Its at $HADOOP_LOGS_DIR. Default location is $HADOOP_HOME/logs. > > > Also, I tried using JConsole to track the memory management on the > > process, but unfortunately the hadoop process does not have the > > "management agent" activated, so it cannot be tracked by JConsole. > > Is there any way to activate it using java options? > > > There is. Add these system properties to HADOOP_OPTS and to the child > jvm args: > > com.sun.management.jmxremote.authenticate=false > com.sun.management.jmxremote.ssl=false > com.sun.management.jmxremote.port=/portNum/ > > See http://java.sun.com/j2se/1.5.0/docs/guide/management/agent.html for > general overview. You are probably running multiple JVMs on the one > machine -- for instance, a tasktracker + its children at a minimum -- so > be careful setting ports so they do not clash (and ensure only one child > per tasktracker otherwise when second starts, it will complain port is > already in use). > > You could also enable verbose garbage collection if you want to watch > JVM flailing. Run 'java -X' and look for the loggc command. > > But before you try any of the above, check the tasktracker and child > logs. That it happens close to startup would seem to point at some > basic config. issue (hopefully). Otherwise, I'd suspect a massive or > corrupted record in segments. > > Good luck Ignacio, > St.Ack > > > I will use the environment variables Michael pointed me to, to try to > > increase the Perm Gem size that way. > > > > Thank you. > > > > On 10/15/07, *Michael Stack* <st...@du... > > <mailto:st...@du...>> wrote: > > > > Ignacio Garcia wrote: > > > Hello Andrea, > > > > > > I tried increasing the PermGem size, but it still failed with > > the same > > > error... > > > > > > I modified the following settings on "hadoop-default.xml ": > > > > > > <name>mapred.child.java.opts</name> > > > <value>-Xmx2048m -Xms1024m -XX:PermSize=256m > > > -XX:MaxPermSize=512m</value> > > > > > > That is the only place I could find where I could include Java > > Opts... > > > Should I increase it even more or is this property ignored when > > doing > > > the indexing? > > > > The OOME looks to be in the startup of the update task. The > error.txt > > log you pasted was from the command-line. Have you tried looking > > in the > > remote tasktracker log? It might have more info on where the OOME > is > > happening. > > > > The above setting is for each child process run by each of the > > tasktrackers of your cluster. The child process does the > > heavy-lifting > > so I'm guessing its where you are seeing the OOME'ing. > > > > Regards how to set the memory for tasktrackers, etc., the notes here > > still apply I believe > > http://archive-access.sourceforge.net/projects/nutch/faq.html#env > > (Do a > > search for the referred-to environment variables). > > > > St.Ack > > > > > > > > > > Any help would be greatly appreciated. Thank you. > > > > > > On 10/5/07, *Ignacio Garcia* <igc...@gm... > > <mailto:igc...@gm...> > > > <mailto:igc...@gm... <mailto:igc...@gm...>> > > wrote: > > > > > > I will try increasing the PermGem space as shown in the > > reference > > > you provided. > > > However, in my case the process is not acting as a webapp, so > it > > > does not related completely to the information displayed in > the > > > article. > > > > > > Do you think that shutting down every java application and > just > > > running the nutchwax job would have any benefits in this case? > > > Since I cannot control the number of class loaders created > (I'm > > > just running the code, I did not modify it in any way), I do > > not > > > have any control over this problem. > > > > > > Thank you for the pointers. > > > > > > > > > On 10/5/07, *Andrea Goethals* < an...@hu... > > <mailto:an...@hu...> > > > <mailto:an...@hu... > > <mailto:an...@hu...>>> wrote: > > > > > > On Fri, 5 Oct 2007 13:11:28 -0400, Ignacio Garcia wrote > > > > That might work, but it is not the way that I would > > like to > > > use Nutchwax. > > > > > > > > If I am forced to divide up one of my small collections > > > (~100Gb), I don't > > > > want to even think how many partitions the big > > collections > > > are going > > > > to require. Which means, time wasted partitioning, > > starting > > > several > > > > jobs, merging the created indexes and more... > > > > > > > > I even tried increasing the heap size to 4Gb, the max > > size of > > > RAM in > > > > my system, and that did not work. > > > > > > > > I have attached the last lines of the output provided by > > > Nutchwax, > > > > to see if you can point me to a possible solution to > this > > > problem. > > > > > > Your output shows that the error is > > > java.lang.OutOfMemoryError : PermGen space > > > > > > Is that always the case? If so I don't think that > increasing > > > the heap size is > > > going to help. This page explains the PermGen space well: > > > > > > http://blogs.sun.com/fkieviet/entry/classloader_leaks_the_dreaded_java > > > > > > Andrea > > > > > > > > > > > Also... is there any way to know if it crashed on a > > particular > > > > record / arc file or action to try and avoid it?? and is > > > there a way > > > > to resume the job from the moment it crashed? > > > > > > > > Thank you. > > > > > > > > On 10/2/07, John H. Lee < jl...@ar... > > <mailto:jl...@ar...> > > > <mailto:jl...@ar... <mailto:jl...@ar...>>> > wrote: > > > > > > > > > > The idea is that for each of the N sets of ~500 ARCs, > > > you'll have one > > > > > index and one segment. That way, you can distribute > the > > > index-segment pairs > > > > > across multiple disks or hosts. > > > > > /search/indexes/indexA/ > > > > > /search/indexes/indexB/ > > > > > ... > > > > > /search/segments/segmentA/ > > > > > /search/segments/segmentB/ > > > > > ... > > > > > > > > > > and point searcher.dir at /search. The webapp will > then > > > search all indexes > > > > > under /search/indexes. Alternatively, you can merge > > all of > > > the indexes as > > > > > Stack pointed out. > > > > > > > > > > Hope this helps. > > > > > > > > > > -J > > > > > > > > > > > > > > > > > > > > On Oct 2, 2007, at 5:09 AM, Ignacio Garcia wrote: > > > > > > > > > > Hello, > > > > > > > > > > I tried separating the list of ARCs on smaller sets > > of ~500 > > > ARCs. > > > > > > > > > > The first batch run to completion without problems, > > > however, the second > > > > > batch failed because I was using the same output > > directory > > > as I used for the > > > > > first one. > > > > > > > > > > Why can't I use the same output directory??? Wouldn't > it > > > make sense to > > > > > have all the info the same place, so I can access > > > everything at a time? > > > > > > > > > > How do I divide the collection in smaller portions > > and then > > > combine > > > > > everything on a single index? If I just keep > everything > > > separated I would > > > > > loose a lot of time looking in different indexes and > > > configuring the web-app > > > > > to be able to look everywhere. > > > > > > > > > > On 9/28/07, Ignacio Garcia <igc...@gm... > > <mailto:igc...@gm...> > > > <mailto:igc...@gm... > > <mailto:igc...@gm...>>> wrote: > > > > > > > > > > > > Michael, I do not know if it failed on the same > > record... > > > > > > > > > > > > the first time it failed I assumed that increasing > > the > > > -Xmx parameters > > > > > > would solve it, since the OOME has happened before > > when > > > indexing with > > > > > > Wayback. > > > > > > > > > > > > I will try to narrow it as much as I can if it > > fails again. > > > > > > > > > > > > > > > > > > On 9/27/07, Michael Stack < st...@du... > > <mailto:st...@du...> > > > <mailto:st...@du... <mailto:st...@du...>>> > wrote: > > > > > > > > > > > > > > What John says and then > > > > > > > > > > > > > > + The OOME exception stack trace might tell us > > something. > > > > > > > + Is the OOME always in same place processing same > > > record? If so, > > > > > > > take > > > > > > > a look at it in the ARC. > > > > > > > > > > > > > > St.Ack > > > > > > > > > > > > > > John H. Lee wrote: > > > > > > > > Hi Ignacio. > > > > > > > > > > > > > > > > It would be helpful if you posted the following > > > information: > > > > > > > > - Are you using standalone or mapreduce? > > > > > > > > - If mapreduce, what are your mapred.map.tasksand > > > > > > > > mapred.reduce.tasks properties set to? > > > > > > > > - If mapreduce, how many slaves do you have > > and how > > > much memory do > > > > > > > > they have? > > > > > > > > - How many ARCs are you trying to index? > > > > > > > > - Did the map reach 100% completion before the > > > failure occurred? > > > > > > > > > > > > > > > > Some things you may want to try: > > > > > > > > - Set both -Xmx and -Xmx to the maximum > > available on > > > your systems > > > > > > > > - Increase one or both of mapred.map.tasks and > > > mapred.reduce.tasks, > > > > > > > > depending where the failure occurred > > > > > > > > - Break your job up into smaller chunks of > > say, 1000 > > > or 5000 ARCs > > > > > > > > > > > > > > > > -J > > > > > > > > > > > > > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia > > wrote: > > > > > > > > > > > > > > > > > > > > > > > >> Hello, > > > > > > > >> > > > > > > > >> I've been doing some testing with nutchwax and > I > > > have never had any > > > > > > > >> major problems. > > > > > > > >> However, right now I am trying to index a > > collection > > > that is over > > > > > > > >> 100 Gb big, and for some reason the indexing is > > > crashing while it > > > > > > > >> tries to populate 'crawldb' > > > > > > > >> > > > > > > > >> The job will run fine at the beginning > > importing the > > > information > > > > > > > >> from the ARCs and creating the "segments" > > section. > > > > > > > >> > > > > > > > >> The error I get is an outOfMemory error when > the > > > system is > > > > > > > >> processing each of the part.xx in the segments > > > previously created. > > > > > > > >> > > > > > > > >> I tried increasing the following setting on the > > > hadoop-default.xml > > > > > > > >> config file: mapred.child.java.opts to 1GB, > > but it > > > still failed in > > > > > > > >> the same part. > > > > > > > >> > > > > > > > >> Is there any way to reduce the amount of > > memory used > > > by nutchwax/ > > > > > > > >> hadoop to make the process more efficient and > be > > > able to index such > > > > > > > >> a collection? > > > > > > > >> > > > > > > > >> Thank you. > > > > > > > >> > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > > > > >> --- > > > > > > > >> This SF.net email is sponsored by: Microsoft > > > > > > > >> Defy all challenges. Microsoft(R) Visual > > Studio 2005. > > > > > > > >> > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > > >> _______________________________________________ > > > > > > > >> Archive-access-discuss mailing list > > > > > > > >> Arc...@li... > > <mailto:Arc...@li...> > > > <mailto: Arc...@li... > > <mailto:Arc...@li...>> > > > > > > > >> > > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > <https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > > > > > > > > > This SF.net email is sponsored by: Microsoft > > > > > > > > Defy all challenges. Microsoft(R) Visual > > Studio 2005. > > > > > > > > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > > > _______________________________________________ > > > > > > > > Archive-access-discuss mailing list > > > > > > > > Arc...@li... > > <mailto:Arc...@li...> > > > <mailto:Arc...@li... > > <mailto:Arc...@li...>> > > > > > > > > > > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > <https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > <https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Harvard University Library > > > Powered by Open WebMail > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > > ------------------------------------------------------------------------- > > > > > This SF.net email is sponsored by: Splunk Inc. > > > Still grepping through log files to find problems? Stop. > > > Now Search log events and configuration files using AJAX and a > > browser. > > > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > Archive-access-discuss mailing list > > > Arc...@li... > > <mailto:Arc...@li...> > > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > <https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > > > > |
|
From: Erik H. <eri...@uc...> - 2007-10-26 18:16:25
|
Hello all. I’ve been informed that my previous messages did get through. My copies seem to be have been caught in some sort of black hole, since I only received the last & I don’t see any on the archive. In any case, apologies for the spam. best, Erik Hetzner ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 |
|
From: Erik H. <eri...@uc...> - 2007-10-26 17:48:48
|
[Trying a 3rd time with 0 attachments. ] [Sent this yesterday, but I think the attachments blocked it.] Hi all. I have been experimenting a bit using the WebRunner program from Mozilla [1] to work as a ‘player’ for archived web content available through a wayback proxy server. This allows a user to view archived web content, & _only_ archived web content, through a special stand alone browser. I think that this might prove interesting, as it does not require configuration of a proxy but provides a user with all the benefits of proxy browsing. Additionally, the act of starting a separate browser may give the user a sense of being in a different place than the live web. The webapp bundle attached connects to the Eprint Network archive-it collection; it could of course be attached to any collection served through a proxy server. To use this, you’ll need to download WebRunner from [2]. I have made the webapp file available (for a limited time) at <http://gales.cdlib.org/~egh/wayback.webapp> Run the wayback.webapp file with Webrunner, either as described in [3], or, on Linux: > webrunner -webapp path/to/wayback.webapp Unfortunately, due to a bug in the current version of Webrunner, it will not properly the first time. Quit the program. Next, you need to run the webapp from the cache. On linux (from the webrunner dir): > webrunner -webapp ep...@ga... On the Mac (from the command line, in the directory where you save WebRunner.app): > ./WebRunner.app/Contents/MacOS/xulrunner -webapp ep...@ga... I am not sure how this can be done on Windows. Presumably in a similar way? Once you have run the program a second time, you will be presented with a standard wayback interface. A good starting point is: http://www.osti.gov/eprints/urls/eprints-index.html You’ll notice that you are running in proxy mode; all the urls look normal. You’ll also notice a sidebar, showing the IA FAQ. This is useless in this case, but could be used to view a timeline or metadata about a url without changing the HTML of the archived page. There are obviously many improvements which would be nice; I haven’t figured out how to make the location bar editable, for instance. Hoping somebody finds this useful. best, Erik Hetzner 1. <http://wiki.mozilla.org/WebRunner> 2. <http://wiki.mozilla.org/WebRunner#Latest_version> 3. <http://wiki.mozilla.org/WebRunner#Installer> |
|
From: Erik H. <eri...@uc...> - 2007-10-25 18:37:51
|
[Sent this yesterday, but I think the attachments blocked it.] Hi all. I have been experimenting a bit using the WebRunner program from Mozilla [1] to work as a ‘player’ for archived web content available through a wayback proxy server. This allows a user to view archived web content, & _only_ archived web content, through a special stand alone browser. I think that this might prove interesting, as it does not require configuration of a proxy but provides a user with all the benefits of proxy browsing. Additionally, the act of starting a separate browser may give the user a sense of being in a different place than the live web. The webapp bundle attached connects to the Eprint Network archive-it collection; it could of course be attached to any collection served through a proxy server. To use this, you’ll need to download WebRunner from [2]. I have attached the webapp.ini & webapp.js files necessary to build this webapp (instructions available at [1]) & have made the webapp itself available (for a limited time) at <http://gales.cdlib.org/~egh/wayback.webapp> Run the wayback.webapp file with Webrunner, either as described in [3], or, on Linux: > webrunner -webapp path/to/wayback.webapp Unfortunately, due to a bug in the current version of Webrunner, it will not properly the first time. Quit the program. Next, you need to run the webapp from the cache. On linux (from the webrunner dir): > webrunner -webapp ep...@ga... On the Mac (from the command line, in the directory where you save WebRunner.app): > ./WebRunner.app/Contents/MacOS/xulrunner -webapp ep...@ga... I am not sure how this can be done on Windows. Presumably in a similar way? Once you have run the program a second time, you will be presented with a standard wayback interface. A good starting point is: http://www.osti.gov/eprints/urls/eprints-index.html You’ll notice that you are running in proxy mode; all the urls look normal. You’ll also notice a sidebar, showing the IA FAQ. This is useless in this case, but could be used to view a timeline or metadata about a url without changing the HTML of the archived page. There are obviously many improvements which would be nice; I haven’t figured out how to make the location bar editable, for instance. Hoping somebody finds this useful. best, Erik Hetzner 1. <http://wiki.mozilla.org/WebRunner> 2. <http://wiki.mozilla.org/WebRunner#Latest_version> 3. <http://wiki.mozilla.org/WebRunner#Installer> |
|
From: alexis a. <alx...@ya...> - 2007-10-25 03:38:58
|
Hi, Using the distributed mode of Nutchwax, we were able to cut our indexing time by as much as 80%. However, we noticed that the index/arcfiles size ratio was bigger than that of the old Nutchwax(<20%). Is this normal? Furthermore, in the old Nutchwax, we were able to reduce the disk usage by deleting some of the temp folders. Are there folders that we can delete in the new nutchwax output folder? Below is our index stats: Total Arcfiles: 12005 Total Arcfiles Size: 664GB Index Size: 172GB ( does not include merged index ) Index/Arcfiles: 26% Best Regards, Alexis __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: Erik H. <eri...@uc...> - 2007-10-25 01:01:21
|
Hi all. I have been experimenting a bit using the WebRunner program from Mozilla [1] to work as a ‘player’ for archived web content available through a wayback proxy server. This allows a user to view archived web content, & _only_ archived web content, through a special stand alone browser. I think that this might prove interesting, as it does not require configuration of a proxy but provides a user with all the benefits of proxy browsing. Additionally, the act of starting a separate browser may give the user a sense of being in a different place than the live web. The webapp bundle attached connects to the Eprint Network archive-it collection; it could of course be attached to any collection served through a proxy server. To use this, you’ll need to download WebRunner from [2]. Run the attached wayback.webapp file with Webrunner, either as described in [3], or, on Linux: > webrunner -webapp path/to/wayback.webapp Unfortunately, due to a bug in the current version of Webrunner, it will not properly the first time. Quit the program. Next, you need to run the webapp from the cache. On linux (from the webrunner dir): > webrunner -webapp ep...@ga... On the Mac (from the command line, in the directory where you save WebRunner.app): > ./WebRunner.app/Contents/MacOS/xulrunner -webapp ep...@ga... I am not sure how this can be done on Windows. Presumably in a similar way? Once you have run the program a second time, you will be presented with a standard wayback interface. A good starting point is: http://www.osti.gov/eprints/urls/eprints-index.html You’ll notice that you are running in proxy mode; all the urls look normal. You’ll also notice a sidebar, showing the IA FAQ. This is useless in this case, but could be used to view a timeline or metadata about a url without changing the HTML of the archived page. There are obviously many improvements which would be nice; I haven’t figured out how to make the location bar editable, for instance. Hoping somebody finds this useful. best, Erik Hetzner 1. <http://wiki.mozilla.org/WebRunner> 2. <http://wiki.mozilla.org/WebRunner#Latest_version> 3. <http://wiki.mozilla.org/WebRunner#Installer> |
|
From: Chris V. <cv...@gm...> - 2007-10-23 22:06:40
|
Hi, I'd like to add extra metadata to indexes produced by NutchWax. The goal is to perform searches against this metadata and full text at the same time. My initial idea is to update documents similarly to suggested practices for updating documents in Lucene indexes: retrieve documents based on search term(s), delete documents from index, add new fields to documents, and then add documents back to index. I am able to follow this strategy using the Lucene 2.0 classes IndexSearcher, IndexReader and IndexWriter (or IndexModifier). After the index documents have been updated, I can query against the new metadata using the IndexSearcher class without any problem. I can also use Luke to view the contents of the index and verify that the metadata has been added to the documents. The problem is that once the Index* classes are done updating the index documents, the NutchWax webapp is unable to locate those documents (even after a restart). My question is what is the best way to add fields to NutchWax index documents? Are there any Nutch or NutchWax classes I should use instead of the Lucene Index* classes (I didn't see any likely candidates in either project)? Is it possible I am leaving out some important steps when using the Lucene Index* classes? Any help is appreciated, Chris |
|
From: Brad T. <br...@ar...> - 2007-10-18 23:41:16
|
This maintenance release fixes a bug which prevented AccessPoints from working properly when the webapp was deployed to a non-ROOT context. |
|
From: Michael S. <st...@du...> - 2007-10-17 16:32:48
|
Ignacio Garcia wrote: > Hello Michael, > > Where can I find the tasktracker log?? Is it under hadoop? nutchwax? > or in a temp location? Its at $HADOOP_LOGS_DIR. Default location is $HADOOP_HOME/logs. > Also, I tried using JConsole to track the memory management on the > process, but unfortunately the hadoop process does not have the > "management agent" activated, so it cannot be tracked by JConsole. > Is there any way to activate it using java options? > There is. Add these system properties to HADOOP_OPTS and to the child jvm args: com.sun.management.jmxremote.authenticate=false com.sun.management.jmxremote.ssl=false com.sun.management.jmxremote.port=/portNum/ See http://java.sun.com/j2se/1.5.0/docs/guide/management/agent.html for general overview. You are probably running multiple JVMs on the one machine -- for instance, a tasktracker + its children at a minimum -- so be careful setting ports so they do not clash (and ensure only one child per tasktracker otherwise when second starts, it will complain port is already in use). You could also enable verbose garbage collection if you want to watch JVM flailing. Run 'java -X' and look for the loggc command. But before you try any of the above, check the tasktracker and child logs. That it happens close to startup would seem to point at some basic config. issue (hopefully). Otherwise, I'd suspect a massive or corrupted record in segments. Good luck Ignacio, St.Ack > I will use the environment variables Michael pointed me to, to try to > increase the Perm Gem size that way. > > Thank you. > > On 10/15/07, *Michael Stack* <st...@du... > <mailto:st...@du...>> wrote: > > Ignacio Garcia wrote: > > Hello Andrea, > > > > I tried increasing the PermGem size, but it still failed with > the same > > error... > > > > I modified the following settings on "hadoop-default.xml ": > > > > <name>mapred.child.java.opts</name> > > <value>-Xmx2048m -Xms1024m -XX:PermSize=256m > > -XX:MaxPermSize=512m</value> > > > > That is the only place I could find where I could include Java > Opts... > > Should I increase it even more or is this property ignored when > doing > > the indexing? > > The OOME looks to be in the startup of the update task. The error.txt > log you pasted was from the command-line. Have you tried looking > in the > remote tasktracker log? It might have more info on where the OOME is > happening. > > The above setting is for each child process run by each of the > tasktrackers of your cluster. The child process does the > heavy-lifting > so I'm guessing its where you are seeing the OOME'ing. > > Regards how to set the memory for tasktrackers, etc., the notes here > still apply I believe > http://archive-access.sourceforge.net/projects/nutch/faq.html#env > (Do a > search for the referred-to environment variables). > > St.Ack > > > > > > Any help would be greatly appreciated. Thank you. > > > > On 10/5/07, *Ignacio Garcia* <igc...@gm... > <mailto:igc...@gm...> > > <mailto:igc...@gm... <mailto:igc...@gm...>> > wrote: > > > > I will try increasing the PermGem space as shown in the > reference > > you provided. > > However, in my case the process is not acting as a webapp, so it > > does not related completely to the information displayed in the > > article. > > > > Do you think that shutting down every java application and just > > running the nutchwax job would have any benefits in this case? > > Since I cannot control the number of class loaders created (I'm > > just running the code, I did not modify it in any way), I do > not > > have any control over this problem. > > > > Thank you for the pointers. > > > > > > On 10/5/07, *Andrea Goethals* < an...@hu... > <mailto:an...@hu...> > > <mailto:an...@hu... > <mailto:an...@hu...>>> wrote: > > > > On Fri, 5 Oct 2007 13:11:28 -0400, Ignacio Garcia wrote > > > That might work, but it is not the way that I would > like to > > use Nutchwax. > > > > > > If I am forced to divide up one of my small collections > > (~100Gb), I don't > > > want to even think how many partitions the big > collections > > are going > > > to require. Which means, time wasted partitioning, > starting > > several > > > jobs, merging the created indexes and more... > > > > > > I even tried increasing the heap size to 4Gb, the max > size of > > RAM in > > > my system, and that did not work. > > > > > > I have attached the last lines of the output provided by > > Nutchwax, > > > to see if you can point me to a possible solution to this > > problem. > > > > Your output shows that the error is > > java.lang.OutOfMemoryError : PermGen space > > > > Is that always the case? If so I don't think that increasing > > the heap size is > > going to help. This page explains the PermGen space well: > > > http://blogs.sun.com/fkieviet/entry/classloader_leaks_the_dreaded_java > > > > Andrea > > > > > > > > Also... is there any way to know if it crashed on a > particular > > > record / arc file or action to try and avoid it?? and is > > there a way > > > to resume the job from the moment it crashed? > > > > > > Thank you. > > > > > > On 10/2/07, John H. Lee < jl...@ar... > <mailto:jl...@ar...> > > <mailto:jl...@ar... <mailto:jl...@ar...>>> wrote: > > > > > > > > The idea is that for each of the N sets of ~500 ARCs, > > you'll have one > > > > index and one segment. That way, you can distribute the > > index-segment pairs > > > > across multiple disks or hosts. > > > > /search/indexes/indexA/ > > > > /search/indexes/indexB/ > > > > ... > > > > /search/segments/segmentA/ > > > > /search/segments/segmentB/ > > > > ... > > > > > > > > and point searcher.dir at /search. The webapp will then > > search all indexes > > > > under /search/indexes. Alternatively, you can merge > all of > > the indexes as > > > > Stack pointed out. > > > > > > > > Hope this helps. > > > > > > > > -J > > > > > > > > > > > > > > > > On Oct 2, 2007, at 5:09 AM, Ignacio Garcia wrote: > > > > > > > > Hello, > > > > > > > > I tried separating the list of ARCs on smaller sets > of ~500 > > ARCs. > > > > > > > > The first batch run to completion without problems, > > however, the second > > > > batch failed because I was using the same output > directory > > as I used for the > > > > first one. > > > > > > > > Why can't I use the same output directory??? Wouldn't it > > make sense to > > > > have all the info the same place, so I can access > > everything at a time? > > > > > > > > How do I divide the collection in smaller portions > and then > > combine > > > > everything on a single index? If I just keep everything > > separated I would > > > > loose a lot of time looking in different indexes and > > configuring the web-app > > > > to be able to look everywhere. > > > > > > > > On 9/28/07, Ignacio Garcia <igc...@gm... > <mailto:igc...@gm...> > > <mailto:igc...@gm... > <mailto:igc...@gm...>>> wrote: > > > > > > > > > > Michael, I do not know if it failed on the same > record... > > > > > > > > > > the first time it failed I assumed that increasing > the > > -Xmx parameters > > > > > would solve it, since the OOME has happened before > when > > indexing with > > > > > Wayback. > > > > > > > > > > I will try to narrow it as much as I can if it > fails again. > > > > > > > > > > > > > > > On 9/27/07, Michael Stack < st...@du... > <mailto:st...@du...> > > <mailto:st...@du... <mailto:st...@du...>>> wrote: > > > > > > > > > > > > What John says and then > > > > > > > > > > > > + The OOME exception stack trace might tell us > something. > > > > > > + Is the OOME always in same place processing same > > record? If so, > > > > > > take > > > > > > a look at it in the ARC. > > > > > > > > > > > > St.Ack > > > > > > > > > > > > John H. Lee wrote: > > > > > > > Hi Ignacio. > > > > > > > > > > > > > > It would be helpful if you posted the following > > information: > > > > > > > - Are you using standalone or mapreduce? > > > > > > > - If mapreduce, what are your mapred.map.tasks and > > > > > > > mapred.reduce.tasks properties set to? > > > > > > > - If mapreduce, how many slaves do you have > and how > > much memory do > > > > > > > they have? > > > > > > > - How many ARCs are you trying to index? > > > > > > > - Did the map reach 100% completion before the > > failure occurred? > > > > > > > > > > > > > > Some things you may want to try: > > > > > > > - Set both -Xmx and -Xmx to the maximum > available on > > your systems > > > > > > > - Increase one or both of mapred.map.tasks and > > mapred.reduce.tasks, > > > > > > > depending where the failure occurred > > > > > > > - Break your job up into smaller chunks of > say, 1000 > > or 5000 ARCs > > > > > > > > > > > > > > -J > > > > > > > > > > > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia > wrote: > > > > > > > > > > > > > > > > > > > > >> Hello, > > > > > > >> > > > > > > >> I've been doing some testing with nutchwax and I > > have never had any > > > > > > >> major problems. > > > > > > >> However, right now I am trying to index a > collection > > that is over > > > > > > >> 100 Gb big, and for some reason the indexing is > > crashing while it > > > > > > >> tries to populate 'crawldb' > > > > > > >> > > > > > > >> The job will run fine at the beginning > importing the > > information > > > > > > >> from the ARCs and creating the "segments" > section. > > > > > > >> > > > > > > >> The error I get is an outOfMemory error when the > > system is > > > > > > >> processing each of the part.xx in the segments > > previously created. > > > > > > >> > > > > > > >> I tried increasing the following setting on the > > hadoop-default.xml > > > > > > >> config file: mapred.child.java.opts to 1GB, > but it > > still failed in > > > > > > >> the same part. > > > > > > >> > > > > > > >> Is there any way to reduce the amount of > memory used > > by nutchwax/ > > > > > > >> hadoop to make the process more efficient and be > > able to index such > > > > > > >> a collection? > > > > > > >> > > > > > > >> Thank you. > > > > > > >> > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > > >> --- > > > > > > >> This SF.net email is sponsored by: Microsoft > > > > > > >> Defy all challenges. Microsoft(R) Visual > Studio 2005. > > > > > > >> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > >> _______________________________________________ > > > > > > >> Archive-access-discuss mailing list > > > > > > >> Arc...@li... > <mailto:Arc...@li...> > > <mailto: Arc...@li... > <mailto:Arc...@li...>> > > > > > > >> > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > <https://lists.sourceforge.net/lists/listinfo/archive-access-discuss> > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > > > > > > > This SF.net email is sponsored by: Microsoft > > > > > > > Defy all challenges. Microsoft(R) Visual > Studio 2005. > > > > > > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > > _______________________________________________ > > > > > > > Archive-access-discuss mailing list > > > > > > > Arc...@li... > <mailto:Arc...@li...> > > <mailto:Arc...@li... > <mailto:Arc...@li...>> > > > > > > > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > <https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > <https://lists.sourceforge.net/lists/listinfo/archive-access-discuss>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Harvard University Library > > Powered by Open WebMail > > > > > > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------- > > > This SF.net email is sponsored by: Splunk Inc. > > Still grepping through log files to find problems? Stop. > > Now Search log events and configuration files using AJAX and a > browser. > > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > <mailto:Arc...@li...> > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > <https://lists.sourceforge.net/lists/listinfo/archive-access-discuss> > > > > |
|
From: Ignacio G. <igc...@gm...> - 2007-10-17 14:09:48
|
Hello Michael, Where can I find the tasktracker log?? Is it under hadoop? nutchwax? or in a temp location? Also, I tried using JConsole to track the memory management on the process, but unfortunately the hadoop process does not have the "management agent" activated, so it cannot be tracked by JConsole. Is there any way to activate it using java options? I will use the environment variables Michael pointed me to, to try to increase the Perm Gem size that way. Thank you. On 10/15/07, Michael Stack <st...@du...> wrote: > > Ignacio Garcia wrote: > > Hello Andrea, > > > > I tried increasing the PermGem size, but it still failed with the same > > error... > > > > I modified the following settings on "hadoop-default.xml": > > > > <name>mapred.child.java.opts</name> > > <value>-Xmx2048m -Xms1024m -XX:PermSize=256m > > -XX:MaxPermSize=512m</value> > > > > That is the only place I could find where I could include Java Opts... > > Should I increase it even more or is this property ignored when doing > > the indexing? > > The OOME looks to be in the startup of the update task. The error.txt > log you pasted was from the command-line. Have you tried looking in the > remote tasktracker log? It might have more info on where the OOME is > happening. > > The above setting is for each child process run by each of the > tasktrackers of your cluster. The child process does the heavy-lifting > so I'm guessing its where you are seeing the OOME'ing. > > Regards how to set the memory for tasktrackers, etc., the notes here > still apply I believe > http://archive-access.sourceforge.net/projects/nutch/faq.html#env (Do a > search for the referred-to environment variables). > > St.Ack > > > > > > Any help would be greatly appreciated. Thank you. > > > > On 10/5/07, *Ignacio Garcia* <igc...@gm... > > <mailto:igc...@gm...> > wrote: > > > > I will try increasing the PermGem space as shown in the reference > > you provided. > > However, in my case the process is not acting as a webapp, so it > > does not related completely to the information displayed in the > > article. > > > > Do you think that shutting down every java application and just > > running the nutchwax job would have any benefits in this case? > > Since I cannot control the number of class loaders created (I'm > > just running the code, I did not modify it in any way), I do not > > have any control over this problem. > > > > Thank you for the pointers. > > > > > > On 10/5/07, *Andrea Goethals* < an...@hu... > > <mailto:an...@hu...>> wrote: > > > > On Fri, 5 Oct 2007 13:11:28 -0400, Ignacio Garcia wrote > > > That might work, but it is not the way that I would like to > > use Nutchwax. > > > > > > If I am forced to divide up one of my small collections > > (~100Gb), I don't > > > want to even think how many partitions the big collections > > are going > > > to require. Which means, time wasted partitioning, starting > > several > > > jobs, merging the created indexes and more... > > > > > > I even tried increasing the heap size to 4Gb, the max size of > > RAM in > > > my system, and that did not work. > > > > > > I have attached the last lines of the output provided by > > Nutchwax, > > > to see if you can point me to a possible solution to this > > problem. > > > > Your output shows that the error is > > java.lang.OutOfMemoryError: PermGen space > > > > Is that always the case? If so I don't think that increasing > > the heap size is > > going to help. This page explains the PermGen space well: > > > http://blogs.sun.com/fkieviet/entry/classloader_leaks_the_dreaded_java > > > > Andrea > > > > > > > > Also... is there any way to know if it crashed on a particular > > > record / arc file or action to try and avoid it?? and is > > there a way > > > to resume the job from the moment it crashed? > > > > > > Thank you. > > > > > > On 10/2/07, John H. Lee < jl...@ar... > > <mailto:jl...@ar...>> wrote: > > > > > > > > The idea is that for each of the N sets of ~500 ARCs, > > you'll have one > > > > index and one segment. That way, you can distribute the > > index-segment pairs > > > > across multiple disks or hosts. > > > > /search/indexes/indexA/ > > > > /search/indexes/indexB/ > > > > ... > > > > /search/segments/segmentA/ > > > > /search/segments/segmentB/ > > > > ... > > > > > > > > and point searcher.dir at /search. The webapp will then > > search all indexes > > > > under /search/indexes. Alternatively, you can merge all of > > the indexes as > > > > Stack pointed out. > > > > > > > > Hope this helps. > > > > > > > > -J > > > > > > > > > > > > > > > > On Oct 2, 2007, at 5:09 AM, Ignacio Garcia wrote: > > > > > > > > Hello, > > > > > > > > I tried separating the list of ARCs on smaller sets of ~500 > > ARCs. > > > > > > > > The first batch run to completion without problems, > > however, the second > > > > batch failed because I was using the same output directory > > as I used for the > > > > first one. > > > > > > > > Why can't I use the same output directory??? Wouldn't it > > make sense to > > > > have all the info the same place, so I can access > > everything at a time? > > > > > > > > How do I divide the collection in smaller portions and then > > combine > > > > everything on a single index? If I just keep everything > > separated I would > > > > loose a lot of time looking in different indexes and > > configuring the web-app > > > > to be able to look everywhere. > > > > > > > > On 9/28/07, Ignacio Garcia <igc...@gm... > > <mailto:igc...@gm...>> wrote: > > > > > > > > > > Michael, I do not know if it failed on the same record... > > > > > > > > > > the first time it failed I assumed that increasing the > > -Xmx parameters > > > > > would solve it, since the OOME has happened before when > > indexing with > > > > > Wayback. > > > > > > > > > > I will try to narrow it as much as I can if it fails > again. > > > > > > > > > > > > > > > On 9/27/07, Michael Stack < st...@du... > > <mailto:st...@du...>> wrote: > > > > > > > > > > > > What John says and then > > > > > > > > > > > > + The OOME exception stack trace might tell us > something. > > > > > > + Is the OOME always in same place processing same > > record? If so, > > > > > > take > > > > > > a look at it in the ARC. > > > > > > > > > > > > St.Ack > > > > > > > > > > > > John H. Lee wrote: > > > > > > > Hi Ignacio. > > > > > > > > > > > > > > It would be helpful if you posted the following > > information: > > > > > > > - Are you using standalone or mapreduce? > > > > > > > - If mapreduce, what are your mapred.map.tasks and > > > > > > > mapred.reduce.tasks properties set to? > > > > > > > - If mapreduce, how many slaves do you have and how > > much memory do > > > > > > > they have? > > > > > > > - How many ARCs are you trying to index? > > > > > > > - Did the map reach 100% completion before the > > failure occurred? > > > > > > > > > > > > > > Some things you may want to try: > > > > > > > - Set both -Xmx and -Xmx to the maximum available on > > your systems > > > > > > > - Increase one or both of mapred.map.tasks and > > mapred.reduce.tasks, > > > > > > > depending where the failure occurred > > > > > > > - Break your job up into smaller chunks of say, 1000 > > or 5000 ARCs > > > > > > > > > > > > > > -J > > > > > > > > > > > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > > > > > > > > > > > > > > > > > > > > >> Hello, > > > > > > >> > > > > > > >> I've been doing some testing with nutchwax and I > > have never had any > > > > > > >> major problems. > > > > > > >> However, right now I am trying to index a collection > > that is over > > > > > > >> 100 Gb big, and for some reason the indexing is > > crashing while it > > > > > > >> tries to populate 'crawldb' > > > > > > >> > > > > > > >> The job will run fine at the beginning importing the > > information > > > > > > >> from the ARCs and creating the "segments" section. > > > > > > >> > > > > > > >> The error I get is an outOfMemory error when the > > system is > > > > > > >> processing each of the part.xx in the segments > > previously created. > > > > > > >> > > > > > > >> I tried increasing the following setting on the > > hadoop-default.xml > > > > > > >> config file: mapred.child.java.opts to 1GB, but it > > still failed in > > > > > > >> the same part. > > > > > > >> > > > > > > >> Is there any way to reduce the amount of memory used > > by nutchwax/ > > > > > > >> hadoop to make the process more efficient and be > > able to index such > > > > > > >> a collection? > > > > > > >> > > > > > > >> Thank you. > > > > > > >> > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > > >> --- > > > > > > >> This SF.net email is sponsored by: Microsoft > > > > > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > > > >> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > >> _______________________________________________ > > > > > > >> Archive-access-discuss mailing list > > > > > > >> Arc...@li... > > <mailto:Arc...@li...> > > > > > > >> > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > > > > > > > This SF.net email is sponsored by: Microsoft > > > > > > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > > > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > > _______________________________________________ > > > > > > > Archive-access-discuss mailing list > > > > > > > Arc...@li... > > <mailto:Arc...@li...> > > > > > > > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > < > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Harvard University Library > > Powered by Open WebMail > > > > > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Splunk Inc. > > Still grepping through log files to find problems? Stop. > > Now Search log events and configuration files using AJAX and a browser. > > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > |
|
From: Brad T. <br...@ar...> - 2007-10-15 23:06:32
|
Wayback is an open-source java implementation of the Internet Archive's Wayback Machine service. The 1.0.0 release includes Spring IOC configuration, access-control and authorization, simplified extension of the Replay User Interface, and uses Maven 2. For detailed features and changes, please see the Wayback project site at http://archive-access.sourceforge.net/projects/wayback/. Yours, Internet Archive Webteam |
|
From: Michael S. <st...@du...> - 2007-10-15 18:38:13
|
Ignacio Garcia wrote: > Hello Andrea, > > I tried increasing the PermGem size, but it still failed with the same > error... > > I modified the following settings on "hadoop-default.xml": > > <name>mapred.child.java.opts</name> > <value>-Xmx2048m -Xms1024m -XX:PermSize=256m > -XX:MaxPermSize=512m</value> > > That is the only place I could find where I could include Java Opts... > Should I increase it even more or is this property ignored when doing > the indexing? The OOME looks to be in the startup of the update task. The error.txt log you pasted was from the command-line. Have you tried looking in the remote tasktracker log? It might have more info on where the OOME is happening. The above setting is for each child process run by each of the tasktrackers of your cluster. The child process does the heavy-lifting so I'm guessing its where you are seeing the OOME'ing. Regards how to set the memory for tasktrackers, etc., the notes here still apply I believe http://archive-access.sourceforge.net/projects/nutch/faq.html#env (Do a search for the referred-to environment variables). St.Ack > > Any help would be greatly appreciated. Thank you. > > On 10/5/07, *Ignacio Garcia* <igc...@gm... > <mailto:igc...@gm...> > wrote: > > I will try increasing the PermGem space as shown in the reference > you provided. > However, in my case the process is not acting as a webapp, so it > does not related completely to the information displayed in the > article. > > Do you think that shutting down every java application and just > running the nutchwax job would have any benefits in this case? > Since I cannot control the number of class loaders created (I'm > just running the code, I did not modify it in any way), I do not > have any control over this problem. > > Thank you for the pointers. > > > On 10/5/07, *Andrea Goethals* < an...@hu... > <mailto:an...@hu...>> wrote: > > On Fri, 5 Oct 2007 13:11:28 -0400, Ignacio Garcia wrote > > That might work, but it is not the way that I would like to > use Nutchwax. > > > > If I am forced to divide up one of my small collections > (~100Gb), I don't > > want to even think how many partitions the big collections > are going > > to require. Which means, time wasted partitioning, starting > several > > jobs, merging the created indexes and more... > > > > I even tried increasing the heap size to 4Gb, the max size of > RAM in > > my system, and that did not work. > > > > I have attached the last lines of the output provided by > Nutchwax, > > to see if you can point me to a possible solution to this > problem. > > Your output shows that the error is > java.lang.OutOfMemoryError: PermGen space > > Is that always the case? If so I don't think that increasing > the heap size is > going to help. This page explains the PermGen space well: > http://blogs.sun.com/fkieviet/entry/classloader_leaks_the_dreaded_java > > Andrea > > > > > Also... is there any way to know if it crashed on a particular > > record / arc file or action to try and avoid it?? and is > there a way > > to resume the job from the moment it crashed? > > > > Thank you. > > > > On 10/2/07, John H. Lee < jl...@ar... > <mailto:jl...@ar...>> wrote: > > > > > > The idea is that for each of the N sets of ~500 ARCs, > you'll have one > > > index and one segment. That way, you can distribute the > index-segment pairs > > > across multiple disks or hosts. > > > /search/indexes/indexA/ > > > /search/indexes/indexB/ > > > ... > > > /search/segments/segmentA/ > > > /search/segments/segmentB/ > > > ... > > > > > > and point searcher.dir at /search. The webapp will then > search all indexes > > > under /search/indexes. Alternatively, you can merge all of > the indexes as > > > Stack pointed out. > > > > > > Hope this helps. > > > > > > -J > > > > > > > > > > > > On Oct 2, 2007, at 5:09 AM, Ignacio Garcia wrote: > > > > > > Hello, > > > > > > I tried separating the list of ARCs on smaller sets of ~500 > ARCs. > > > > > > The first batch run to completion without problems, > however, the second > > > batch failed because I was using the same output directory > as I used for the > > > first one. > > > > > > Why can't I use the same output directory??? Wouldn't it > make sense to > > > have all the info the same place, so I can access > everything at a time? > > > > > > How do I divide the collection in smaller portions and then > combine > > > everything on a single index? If I just keep everything > separated I would > > > loose a lot of time looking in different indexes and > configuring the web-app > > > to be able to look everywhere. > > > > > > On 9/28/07, Ignacio Garcia <igc...@gm... > <mailto:igc...@gm...>> wrote: > > > > > > > > Michael, I do not know if it failed on the same record... > > > > > > > > the first time it failed I assumed that increasing the > -Xmx parameters > > > > would solve it, since the OOME has happened before when > indexing with > > > > Wayback. > > > > > > > > I will try to narrow it as much as I can if it fails again. > > > > > > > > > > > > On 9/27/07, Michael Stack < st...@du... > <mailto:st...@du...>> wrote: > > > > > > > > > > What John says and then > > > > > > > > > > + The OOME exception stack trace might tell us something. > > > > > + Is the OOME always in same place processing same > record? If so, > > > > > take > > > > > a look at it in the ARC. > > > > > > > > > > St.Ack > > > > > > > > > > John H. Lee wrote: > > > > > > Hi Ignacio. > > > > > > > > > > > > It would be helpful if you posted the following > information: > > > > > > - Are you using standalone or mapreduce? > > > > > > - If mapreduce, what are your mapred.map.tasks and > > > > > > mapred.reduce.tasks properties set to? > > > > > > - If mapreduce, how many slaves do you have and how > much memory do > > > > > > they have? > > > > > > - How many ARCs are you trying to index? > > > > > > - Did the map reach 100% completion before the > failure occurred? > > > > > > > > > > > > Some things you may want to try: > > > > > > - Set both -Xmx and -Xmx to the maximum available on > your systems > > > > > > - Increase one or both of mapred.map.tasks and > mapred.reduce.tasks, > > > > > > depending where the failure occurred > > > > > > - Break your job up into smaller chunks of say, 1000 > or 5000 ARCs > > > > > > > > > > > > -J > > > > > > > > > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > > > > > > > > > > > > > > > > > >> Hello, > > > > > >> > > > > > >> I've been doing some testing with nutchwax and I > have never had any > > > > > >> major problems. > > > > > >> However, right now I am trying to index a collection > that is over > > > > > >> 100 Gb big, and for some reason the indexing is > crashing while it > > > > > >> tries to populate 'crawldb' > > > > > >> > > > > > >> The job will run fine at the beginning importing the > information > > > > > >> from the ARCs and creating the "segments" section. > > > > > >> > > > > > >> The error I get is an outOfMemory error when the > system is > > > > > >> processing each of the part.xx in the segments > previously created. > > > > > >> > > > > > >> I tried increasing the following setting on the > hadoop-default.xml > > > > > >> config file: mapred.child.java.opts to 1GB, but it > still failed in > > > > > >> the same part. > > > > > >> > > > > > >> Is there any way to reduce the amount of memory used > by nutchwax/ > > > > > >> hadoop to make the process more efficient and be > able to index such > > > > > >> a collection? > > > > > >> > > > > > >> Thank you. > > > > > >> > > > > > > ---------------------------------------------------------------------- > > > > > > >> --- > > > > > >> This SF.net email is sponsored by: Microsoft > > > > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > >> _______________________________________________ > > > > > >> Archive-access-discuss mailing list > > > > > >> Arc...@li... > <mailto:Arc...@li...> > > > > > >> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > > > > > This SF.net email is sponsored by: Microsoft > > > > > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > _______________________________________________ > > > > > > Archive-access-discuss mailing list > > > > > > Arc...@li... > <mailto:Arc...@li...> > > > > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > <https://lists.sourceforge.net/lists/listinfo/archive-access-discuss> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Harvard University Library > Powered by Open WebMail > > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Andrea G. <an...@hu...> - 2007-10-15 14:21:45
|
Ignacio, I don't have much indexing experience so I'm afraid I can't be of much help with this. I was only pointing out that increasing the heap size wasn't going to help the PermGen problem. You could try watching the memory use with jconsole while its running just to verify that the only part of the memory use that's causing problems is the PermGen pool - in jconsole you can see how close to the max each memory pool is. It will also tell you what the max perm gen is so that you can verify that you did increase it with the java command line switch. Sorry I can't be of more help with this, Andrea On 10/15/07, Ignacio Garcia <igc...@gm...> wrote: > Hello Andrea, > > I tried increasing the PermGem size, but it still failed with the same > error... > > I modified the following settings on "hadoop-default.xml": > > <name>mapred.child.java.opts</name> > <value>-Xmx2048m -Xms1024m -XX:PermSize=256m -XX:MaxPermSize=512m</value> > > That is the only place I could find where I could include Java Opts... > Should I increase it even more or is this property ignored when doing the > indexing? > > Any help would be greatly appreciated. Thank you. > > > On 10/5/07, Ignacio Garcia <igc...@gm... > wrote: > > I will try increasing the PermGem space as shown in the reference you > provided. > > However, in my case the process is not acting as a webapp, so it does not > related completely to the information displayed in the article. > > > > Do you think that shutting down every java application and just running > the nutchwax job would have any benefits in this case? > > Since I cannot control the number of class loaders created (I'm just > running the code, I did not modify it in any way), I do not have any control > over this problem. > > > > Thank you for the pointers. > > > > > > > > On 10/5/07, Andrea Goethals < an...@hu...> wrote: > > > On Fri, 5 Oct 2007 13:11:28 -0400, Ignacio Garcia wrote > > > > That might work, but it is not the way that I would like to use > Nutchwax. > > > > > > > > If I am forced to divide up one of my small collections (~100Gb), I > don't > > > > want to even think how many partitions the big collections are going > > > > to require. Which means, time wasted partitioning, starting several > > > > jobs, merging the created indexes and more... > > > > > > > > I even tried increasing the heap size to 4Gb, the max size of RAM in > > > > my system, and that did not work. > > > > > > > > I have attached the last lines of the output provided by Nutchwax, > > > > to see if you can point me to a possible solution to this problem. > > > > > > Your output shows that the error is > > > java.lang.OutOfMemoryError: PermGen space > > > > > > Is that always the case? If so I don't think that increasing the heap > size is > > > going to help. This page explains the PermGen space well: > > > > http://blogs.sun.com/fkieviet/entry/classloader_leaks_the_dreaded_java > > > > > > Andrea > > > > > > > > > > > Also... is there any way to know if it crashed on a particular > > > > record / arc file or action to try and avoid it?? and is there a way > > > > to resume the job from the moment it crashed? > > > > > > > > Thank you. > > > > > > > > On 10/2/07, John H. Lee < jl...@ar...> wrote: > > > > > > > > > > The idea is that for each of the N sets of ~500 ARCs, you'll have > one > > > > > index and one segment. That way, you can distribute the > index-segment pairs > > > > > across multiple disks or hosts. > > > > > /search/indexes/indexA/ > > > > > /search/indexes/indexB/ > > > > > ... > > > > > /search/segments/segmentA/ > > > > > /search/segments/segmentB/ > > > > > ... > > > > > > > > > > and point searcher.dir at /search. The webapp will then search all > indexes > > > > > under /search/indexes. Alternatively, you can merge all of the > indexes as > > > > > Stack pointed out. > > > > > > > > > > Hope this helps. > > > > > > > > > > -J > > > > > > > > > > > > > > > > > > > > On Oct 2, 2007, at 5:09 AM, Ignacio Garcia wrote: > > > > > > > > > > Hello, > > > > > > > > > > I tried separating the list of ARCs on smaller sets of ~500 ARCs. > > > > > > > > > > The first batch run to completion without problems, however, the > second > > > > > batch failed because I was using the same output directory as I used > for the > > > > > first one. > > > > > > > > > > Why can't I use the same output directory??? Wouldn't it make sense > to > > > > > have all the info the same place, so I can access everything at a > time? > > > > > > > > > > How do I divide the collection in smaller portions and then combine > > > > > everything on a single index? If I just keep everything separated I > would > > > > > loose a lot of time looking in different indexes and configuring the > web-app > > > > > to be able to look everywhere. > > > > > > > > > > On 9/28/07, Ignacio Garcia <igc...@gm... > wrote: > > > > > > > > > > > > Michael, I do not know if it failed on the same record... > > > > > > > > > > > > the first time it failed I assumed that increasing the -Xmx > parameters > > > > > > would solve it, since the OOME has happened before when indexing > with > > > > > > Wayback. > > > > > > > > > > > > I will try to narrow it as much as I can if it fails again. > > > > > > > > > > > > > > > > > > On 9/27/07, Michael Stack < st...@du... > wrote: > > > > > > > > > > > > > > What John says and then > > > > > > > > > > > > > > + The OOME exception stack trace might tell us something. > > > > > > > + Is the OOME always in same place processing same record? If > so, > > > > > > > take > > > > > > > a look at it in the ARC. > > > > > > > > > > > > > > St.Ack > > > > > > > > > > > > > > John H. Lee wrote: > > > > > > > > Hi Ignacio. > > > > > > > > > > > > > > > > It would be helpful if you posted the following information: > > > > > > > > - Are you using standalone or mapreduce? > > > > > > > > - If mapreduce, what are your mapred.map.tasks and > > > > > > > > mapred.reduce.tasks properties set to? > > > > > > > > - If mapreduce, how many slaves do you have and how much > memory do > > > > > > > > they have? > > > > > > > > - How many ARCs are you trying to index? > > > > > > > > - Did the map reach 100% completion before the failure > occurred? > > > > > > > > > > > > > > > > Some things you may want to try: > > > > > > > > - Set both -Xmx and -Xmx to the maximum available on your > systems > > > > > > > > - Increase one or both of mapred.map.tasks and > mapred.reduce.tasks, > > > > > > > > depending where the failure occurred > > > > > > > > - Break your job up into smaller chunks of say, 1000 or 5000 > ARCs > > > > > > > > > > > > > > > > -J > > > > > > > > > > > > > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > > > > > > > > > > > > > > > > > > > > > > > >> Hello, > > > > > > > >> > > > > > > > >> I've been doing some testing with nutchwax and I have never > had any > > > > > > > >> major problems. > > > > > > > >> However, right now I am trying to index a collection that is > over > > > > > > > >> 100 Gb big, and for some reason the indexing is crashing > while it > > > > > > > >> tries to populate 'crawldb' > > > > > > > >> > > > > > > > >> The job will run fine at the beginning importing the > information > > > > > > > >> from the ARCs and creating the "segments" section. > > > > > > > >> > > > > > > > >> The error I get is an outOfMemory error when the system is > > > > > > > >> processing each of the part.xx in the segments previously > created. > > > > > > > >> > > > > > > > >> I tried increasing the following setting on the > hadoop-default.xml > > > > > > > >> config file: mapred.child.java.opts to 1GB, but it still > failed in > > > > > > > >> the same part. > > > > > > > >> > > > > > > > >> Is there any way to reduce the amount of memory used by > nutchwax/ > > > > > > > >> hadoop to make the process more efficient and be able to > index such > > > > > > > >> a collection? > > > > > > > >> > > > > > > > >> Thank you. > > > > > > > >> > > > > > > > > ---------------------------------------------------------------------- > > > > > > > >> --- > > > > > > > >> This SF.net email is sponsored by: Microsoft > > > > > > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > > > > >> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > > >> > _______________________________________________ > > > > > > > >> Archive-access-discuss mailing list > > > > > > > >> Arc...@li... > > > > > > > >> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > > > > > > This SF.net email is sponsored by: Microsoft > > > > > > > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > > > > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > > > > _______________________________________________ > > > > > > > > Archive-access-discuss mailing list > > > > > > > > Arc...@li... > > > > > > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Harvard University Library > > > Powered by Open WebMail > > > > > > > > > > > > |
|
From: Ignacio G. <igc...@gm...> - 2007-10-15 12:14:17
|
Hello Andrea, I tried increasing the PermGem size, but it still failed with the same error... I modified the following settings on "hadoop-default.xml": <name>mapred.child.java.opts</name> <value>-Xmx2048m -Xms1024m -XX:PermSize=256m -XX:MaxPermSize=512m</value> That is the only place I could find where I could include Java Opts... Should I increase it even more or is this property ignored when doing the indexing? Any help would be greatly appreciated. Thank you. On 10/5/07, Ignacio Garcia <igc...@gm...> wrote: > > I will try increasing the PermGem space as shown in the reference you > provided. > However, in my case the process is not acting as a webapp, so it does not > related completely to the information displayed in the article. > > Do you think that shutting down every java application and just running > the nutchwax job would have any benefits in this case? > Since I cannot control the number of class loaders created (I'm just > running the code, I did not modify it in any way), I do not have any control > over this problem. > > Thank you for the pointers. > > On 10/5/07, Andrea Goethals <an...@hu...> wrote: > > > > On Fri, 5 Oct 2007 13:11:28 -0400, Ignacio Garcia wrote > > > That might work, but it is not the way that I would like to use > > Nutchwax. > > > > > > If I am forced to divide up one of my small collections (~100Gb), I > > don't > > > want to even think how many partitions the big collections are going > > > to require. Which means, time wasted partitioning, starting several > > > jobs, merging the created indexes and more... > > > > > > I even tried increasing the heap size to 4Gb, the max size of RAM in > > > my system, and that did not work. > > > > > > I have attached the last lines of the output provided by Nutchwax, > > > to see if you can point me to a possible solution to this problem. > > > > Your output shows that the error is > > java.lang.OutOfMemoryError: PermGen space > > > > Is that always the case? If so I don't think that increasing the heap > > size is > > going to help. This page explains the PermGen space well: > > http://blogs.sun.com/fkieviet/entry/classloader_leaks_the_dreaded_java > > > > Andrea > > > > > > > > Also... is there any way to know if it crashed on a particular > > > record / arc file or action to try and avoid it?? and is there a way > > > to resume the job from the moment it crashed? > > > > > > Thank you. > > > > > > On 10/2/07, John H. Lee <jl...@ar...> wrote: > > > > > > > > The idea is that for each of the N sets of ~500 ARCs, you'll have > > one > > > > index and one segment. That way, you can distribute the > > index-segment pairs > > > > across multiple disks or hosts. > > > > /search/indexes/indexA/ > > > > /search/indexes/indexB/ > > > > ... > > > > /search/segments/segmentA/ > > > > /search/segments/segmentB/ > > > > ... > > > > > > > > and point searcher.dir at /search. The webapp will then search all > > indexes > > > > under /search/indexes. Alternatively, you can merge all of the > > indexes as > > > > Stack pointed out. > > > > > > > > Hope this helps. > > > > > > > > -J > > > > > > > > > > > > > > > > On Oct 2, 2007, at 5:09 AM, Ignacio Garcia wrote: > > > > > > > > Hello, > > > > > > > > I tried separating the list of ARCs on smaller sets of ~500 ARCs. > > > > > > > > The first batch run to completion without problems, however, the > > second > > > > batch failed because I was using the same output directory as I used > > for the > > > > first one. > > > > > > > > Why can't I use the same output directory??? Wouldn't it make sense > > to > > > > have all the info the same place, so I can access everything at a > > time? > > > > > > > > How do I divide the collection in smaller portions and then combine > > > > everything on a single index? If I just keep everything separated I > > would > > > > loose a lot of time looking in different indexes and configuring the > > web-app > > > > to be able to look everywhere. > > > > > > > > On 9/28/07, Ignacio Garcia <igc...@gm...> wrote: > > > > > > > > > > Michael, I do not know if it failed on the same record... > > > > > > > > > > the first time it failed I assumed that increasing the -Xmx > > parameters > > > > > would solve it, since the OOME has happened before when indexing > > with > > > > > Wayback. > > > > > > > > > > I will try to narrow it as much as I can if it fails again. > > > > > > > > > > > > > > > On 9/27/07, Michael Stack < st...@du... > wrote: > > > > > > > > > > > > What John says and then > > > > > > > > > > > > + The OOME exception stack trace might tell us something. > > > > > > + Is the OOME always in same place processing same record? If > > so, > > > > > > take > > > > > > a look at it in the ARC. > > > > > > > > > > > > St.Ack > > > > > > > > > > > > John H. Lee wrote: > > > > > > > Hi Ignacio. > > > > > > > > > > > > > > It would be helpful if you posted the following information: > > > > > > > - Are you using standalone or mapreduce? > > > > > > > - If mapreduce, what are your mapred.map.tasks and > > > > > > > mapred.reduce.tasks properties set to? > > > > > > > - If mapreduce, how many slaves do you have and how much > > memory do > > > > > > > they have? > > > > > > > - How many ARCs are you trying to index? > > > > > > > - Did the map reach 100% completion before the failure > > occurred? > > > > > > > > > > > > > > Some things you may want to try: > > > > > > > - Set both -Xmx and -Xmx to the maximum available on your > > systems > > > > > > > - Increase one or both of mapred.map.tasks and > > mapred.reduce.tasks, > > > > > > > depending where the failure occurred > > > > > > > - Break your job up into smaller chunks of say, 1000 or 5000 > > ARCs > > > > > > > > > > > > > > -J > > > > > > > > > > > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > > > > > > > > > > > > > > > > > > > > >> Hello, > > > > > > >> > > > > > > >> I've been doing some testing with nutchwax and I have never > > had any > > > > > > >> major problems. > > > > > > >> However, right now I am trying to index a collection that is > > over > > > > > > >> 100 Gb big, and for some reason the indexing is crashing > > while it > > > > > > >> tries to populate 'crawldb' > > > > > > >> > > > > > > >> The job will run fine at the beginning importing the > > information > > > > > > >> from the ARCs and creating the "segments" section. > > > > > > >> > > > > > > >> The error I get is an outOfMemory error when the system is > > > > > > >> processing each of the part.xx in the segments previously > > created. > > > > > > >> > > > > > > >> I tried increasing the following setting on the > > hadoop-default.xml > > > > > > >> config file: mapred.child.java.opts to 1GB, but it still > > failed in > > > > > > >> the same part. > > > > > > >> > > > > > > >> Is there any way to reduce the amount of memory used by > > nutchwax/ > > > > > > >> hadoop to make the process more efficient and be able to > > index such > > > > > > >> a collection? > > > > > > >> > > > > > > >> Thank you. > > > > > > >> > > > > > > > > ---------------------------------------------------------------------- > > > > > > >> --- > > > > > > >> This SF.net email is sponsored by: Microsoft > > > > > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > >> _______________________________________________ > > > > > > >> Archive-access-discuss mailing list > > > > > > >> Arc...@li... > > > > > > >> > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > > > > > This SF.net email is sponsored by: Microsoft > > > > > > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > > _______________________________________________ > > > > > > > Archive-access-discuss mailing list > > > > > > > Arc...@li... > > > > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Harvard University Library > > Powered by Open WebMail > > > > > |
|
From: Chris V. <cv...@gm...> - 2007-10-09 19:47:44
|
Hi Brad, Thanks for the response. Unfortunately, you won't be able to access our servers due to IP restriction. I tried your suggestion of pointing to nutch/opensearch (and nutchwax/opensearch) without success. There were no errors produced, but it didn't provide the functionality I am looking for either. I am currently implementing full text search and retrieval using a combination of NutchWax (and its index) for search and Wayback (with a separate CDX index) for retrieval. This works fine. I was hoping for a single index solution, but it sounds like you are using the same technique. If you learn anything new from the NutchWax team, please pass it on. Thanks, Chris On 9/26/07, Brad Tofel <br...@ar...> wrote: > > Hi Chris, > > I can't access your nutch service, so am unable to provide very detailed > assistance. One quick thing to test is changing: > > http://chaz.hul.harvard.edu:10622/xmlquery > > to > > http://chaz.hul.harvard.edu:10622/nutch/opensearch > > > As far as which components should be doing what -- NutchWax and Wayback > have drifted a little bit from the point when they were integrated so > that Wayback could utilize a NutchWax index as the it's ResourceIndex. > Performance issues with the NutchWax index motivated us to: > > 1) build a Wayback installation with it's own index, either CDX or BDB > 2) modify seach.jsp as you've done already so links generated by > NutchWax search result pages point into the wayback installation. > > I'm working with John Lee, who is currently running the NutchWax > project, to get a better answer on how this will work going forward. > > Brad > > Chris Vicary wrote: > > Hi, > > > > I am attempting to render nutchwax full text search results using the > > open-source wayback machine. I have installed hadoop, nutchwax (0.10.0) > and > > wayback (0.8.0) - wayback and nutchwax are deployed in the same tomcat. > > Creating and searching full-text indexes of arc files using nutchwax > works > > fine. Unfortunately, I have been unsuccessful in rendering the result > > resources. I attempted to follow the instructions for Wayback-NutchWAX > at > > http://archive-access.sourceforge.net/projects/nutch/wayback.html, but > the > > instructions seem to be based on an older version of wayback, and the > some > > changes specified for the wayback's web.xml do not apply to the newest > > wayback version. > > > > The errors encountered depend on the configuration values I use, so > here's a > > rundown of the properties: > > > > hadoop-site.xml: > > > > searcher.dir points to a local nutchwax "outputs" directory > (/tmp/outputs) > > wax.host points to the host and port of the tomcat installation, it does > not > > include wayback context information (just host:port, > > chaz.hul.harvard.edu:10622) > > > > search.jsp: > > > > made the change: > > > > < String archiveCollection = > > detail.getValue("collection"); > > --- > > > >> String archiveCollection = "wayback"; // detail.getValue > ("collection"); > >> > > > > > > > > wayback/WEB-INF/web.xml: > > > > The changes required for web.xml are to "[disable] wayback indexing of > > ARCS, [comment] out the PipeLineFilter, and [enable] the Remove-Nutch > > ResourceIndex option". > > > > The Local-ARC ResourceStore option is enabled, and all others are > disabled. > > resourcestore.autoindex is set to 0, and all physical paths have been > > checked for accuracy. > > > > I was unable to find any reference to PipeLineFilter, so there was no > need > > to comment it out. > > > > I enabled the Remote-Nutch ResourceIndex option, and disabled all other > > ResourceIndex options. The Remote-Nutch option values are: > > > > <context-param> > > <param-name>resourceindex.classname</param-name> > > <param-value> > org.archive.wayback.resourceindex.NutchResourceIndex > > </param-value> > > <description>Class that implements ResourceIndex for this > > Wayback</description> > > </context-param> > > > > <context-param> > > <param-name>resourceindex.baseurl</param-name> > > <param-value>>http://chaz.hul.harvard.edu:10622/nutchwax > > </param-value> > > <description>absolute URL to Nutch server</description> > > </context-param> > > > > <context-param> > > <param-name>maxresults</param-name> > > <param-value>1000</param-value> > > <description> > > Maximum number of results to return from the > ResourceIndex. > > </description> > > </context-param> > > > > > > With the current setup, I can perform a full-text query using nutchwax > and > > the result links seem to be of the correct form: > > http://[host]:[port]/wayback/[date]/[uri]. But when I click on a link, I > get > > the error: > > Index not available > > > > *Unexpected SAX: White spaces are required between publicId and > systemId.* > > > > * > > *in catalina.out, the stack trace is: > > > > [Fatal Error] > > > ?query=date%3A19960101000000-20070919221459+exacturl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_n > > > otes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupField=site&hitsPerDup=10&hitsPerSite=10:1:63:White > > spaces ar > > e required between publicId and systemId. > > org.xml.sax.SAXParseException: White spaces are required between > publicId > > and systemId. > > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > > DOMParser.java:264) > > 2007-09-19 18:14:59,244 INFO WaxDateQueryFilter - Found range date: > > 19960101000000, 20070919221459 > > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > > (DocumentBuilderImpl.java:292) > > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java > :146) > > at > > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > > NutchResourceIndex.java:348) > > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > > NutchResourceIndex.java:140) > > at org.archive.wayback.replay.ReplayServlet.doGet( > ReplayServlet.java > > :122) > > ... > > > > if I set the resourceindex.baseurl property closer to the original value > > like this: > > > > <context-param> > > <param-name>resourceindex.baseurl</param-name> > > <param-value>http://chaz.hul.harvard.edu:10622/xmlquery > > </param-value> > > <description>absolute URL to Nutch server</description> > > </context-param> > > > > when I click on a result link, I get this error: > > Index not available * > > > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exacturl%3Ahttp. > .. > > * > > > > and the stack trace looks like this: > > > > INFO: initialized > > org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter > > java.io.FileNotFoundException: > > > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exac > > > turl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_notes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupFi > > eld=site&hitsPerDup=10&hitsPerSite=10 > > at sun.net.www.protocol.http.HttpURLConnection.getInputStream( > > HttpURLConnection.java:1147) > > at > > > com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity > ( > > XMLEntityManager.java:973) > > at > > > com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion > > (XMLVersionDetector.java:184) > > at > > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > > XML11Configuration.java:798) > > at > > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > > XML11Configuration.java:764) > > at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse( > > XMLParser.java:148) > > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > > DOMParser.java:250) > > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > > (DocumentBuilderImpl.java:292) > > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java > :146) > > at > > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > > NutchResourceIndex.java:348) > > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > > NutchResourceIndex.java:140) > > ... > > > > It seems like I have not configured the Remote-Nutch ResourceIndex > > properties correctly, but I don't have much to go on to try to correct > it. > > Or perhaps I am not using nutchwax and wayback in the correct roles? > > > > Any help with this is greatly appreciated. > > > > Thanks, > > > > Chris > > > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > |
|
From: Ignacio G. <igc...@gm...> - 2007-10-05 18:01:10
|
I will try increasing the PermGem space as shown in the reference you provided. However, in my case the process is not acting as a webapp, so it does not related completely to the information displayed in the article. Do you think that shutting down every java application and just running the nutchwax job would have any benefits in this case? Since I cannot control the number of class loaders created (I'm just running the code, I did not modify it in any way), I do not have any control over this problem. Thank you for the pointers. On 10/5/07, Andrea Goethals <an...@hu...> wrote: > > On Fri, 5 Oct 2007 13:11:28 -0400, Ignacio Garcia wrote > > That might work, but it is not the way that I would like to use > Nutchwax. > > > > If I am forced to divide up one of my small collections (~100Gb), I > don't > > want to even think how many partitions the big collections are going > > to require. Which means, time wasted partitioning, starting several > > jobs, merging the created indexes and more... > > > > I even tried increasing the heap size to 4Gb, the max size of RAM in > > my system, and that did not work. > > > > I have attached the last lines of the output provided by Nutchwax, > > to see if you can point me to a possible solution to this problem. > > Your output shows that the error is > java.lang.OutOfMemoryError: PermGen space > > Is that always the case? If so I don't think that increasing the heap size > is > going to help. This page explains the PermGen space well: > http://blogs.sun.com/fkieviet/entry/classloader_leaks_the_dreaded_java > > Andrea > > > > > Also... is there any way to know if it crashed on a particular > > record / arc file or action to try and avoid it?? and is there a way > > to resume the job from the moment it crashed? > > > > Thank you. > > > > On 10/2/07, John H. Lee <jl...@ar...> wrote: > > > > > > The idea is that for each of the N sets of ~500 ARCs, you'll have one > > > index and one segment. That way, you can distribute the index-segment > pairs > > > across multiple disks or hosts. > > > /search/indexes/indexA/ > > > /search/indexes/indexB/ > > > ... > > > /search/segments/segmentA/ > > > /search/segments/segmentB/ > > > ... > > > > > > and point searcher.dir at /search. The webapp will then search all > indexes > > > under /search/indexes. Alternatively, you can merge all of the indexes > as > > > Stack pointed out. > > > > > > Hope this helps. > > > > > > -J > > > > > > > > > > > > On Oct 2, 2007, at 5:09 AM, Ignacio Garcia wrote: > > > > > > Hello, > > > > > > I tried separating the list of ARCs on smaller sets of ~500 ARCs. > > > > > > The first batch run to completion without problems, however, the > second > > > batch failed because I was using the same output directory as I used > for the > > > first one. > > > > > > Why can't I use the same output directory??? Wouldn't it make sense to > > > have all the info the same place, so I can access everything at a > time? > > > > > > How do I divide the collection in smaller portions and then combine > > > everything on a single index? If I just keep everything separated I > would > > > loose a lot of time looking in different indexes and configuring the > web-app > > > to be able to look everywhere. > > > > > > On 9/28/07, Ignacio Garcia <igc...@gm...> wrote: > > > > > > > > Michael, I do not know if it failed on the same record... > > > > > > > > the first time it failed I assumed that increasing the -Xmx > parameters > > > > would solve it, since the OOME has happened before when indexing > with > > > > Wayback. > > > > > > > > I will try to narrow it as much as I can if it fails again. > > > > > > > > > > > > On 9/27/07, Michael Stack < st...@du...> wrote: > > > > > > > > > > What John says and then > > > > > > > > > > + The OOME exception stack trace might tell us something. > > > > > + Is the OOME always in same place processing same record? If so, > > > > > take > > > > > a look at it in the ARC. > > > > > > > > > > St.Ack > > > > > > > > > > John H. Lee wrote: > > > > > > Hi Ignacio. > > > > > > > > > > > > It would be helpful if you posted the following information: > > > > > > - Are you using standalone or mapreduce? > > > > > > - If mapreduce, what are your mapred.map.tasks and > > > > > > mapred.reduce.tasks properties set to? > > > > > > - If mapreduce, how many slaves do you have and how much memory > do > > > > > > they have? > > > > > > - How many ARCs are you trying to index? > > > > > > - Did the map reach 100% completion before the failure occurred? > > > > > > > > > > > > Some things you may want to try: > > > > > > - Set both -Xmx and -Xmx to the maximum available on your > systems > > > > > > - Increase one or both of mapred.map.tasks and > mapred.reduce.tasks, > > > > > > depending where the failure occurred > > > > > > - Break your job up into smaller chunks of say, 1000 or 5000 > ARCs > > > > > > > > > > > > -J > > > > > > > > > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > > > > > > > > > > > > > > > > > >> Hello, > > > > > >> > > > > > >> I've been doing some testing with nutchwax and I have never had > any > > > > > >> major problems. > > > > > >> However, right now I am trying to index a collection that is > over > > > > > >> 100 Gb big, and for some reason the indexing is crashing while > it > > > > > >> tries to populate 'crawldb' > > > > > >> > > > > > >> The job will run fine at the beginning importing the > information > > > > > >> from the ARCs and creating the "segments" section. > > > > > >> > > > > > >> The error I get is an outOfMemory error when the system is > > > > > >> processing each of the part.xx in the segments previously > created. > > > > > >> > > > > > >> I tried increasing the following setting on the > hadoop-default.xml > > > > > >> config file: mapred.child.java.opts to 1GB, but it still failed > in > > > > > >> the same part. > > > > > >> > > > > > >> Is there any way to reduce the amount of memory used by > nutchwax/ > > > > > >> hadoop to make the process more efficient and be able to index > such > > > > > >> a collection? > > > > > >> > > > > > >> Thank you. > > > > > >> > > > > > > ---------------------------------------------------------------------- > > > > > >> --- > > > > > >> This SF.net email is sponsored by: Microsoft > > > > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > >> _______________________________________________ > > > > > >> Archive-access-discuss mailing list > > > > > >> Arc...@li... > > > > > >> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > > > > This SF.net email is sponsored by: Microsoft > > > > > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > > _______________________________________________ > > > > > > Archive-access-discuss mailing list > > > > > > Arc...@li... > > > > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Harvard University Library > Powered by Open WebMail > > |
|
From: Andrea G. <an...@hu...> - 2007-10-05 17:31:25
|
On Fri, 5 Oct 2007 13:11:28 -0400, Ignacio Garcia wrote > That might work, but it is not the way that I would like to use Nutchwax. > > If I am forced to divide up one of my small collections (~100Gb), I don't > want to even think how many partitions the big collections are going > to require. Which means, time wasted partitioning, starting several > jobs, merging the created indexes and more... > > I even tried increasing the heap size to 4Gb, the max size of RAM in > my system, and that did not work. > > I have attached the last lines of the output provided by Nutchwax, > to see if you can point me to a possible solution to this problem. Your output shows that the error is java.lang.OutOfMemoryError: PermGen space Is that always the case? If so I don't think that increasing the heap size is going to help. This page explains the PermGen space well: http://blogs.sun.com/fkieviet/entry/classloader_leaks_the_dreaded_java Andrea > > Also... is there any way to know if it crashed on a particular > record / arc file or action to try and avoid it?? and is there a way > to resume the job from the moment it crashed? > > Thank you. > > On 10/2/07, John H. Lee <jl...@ar...> wrote: > > > > The idea is that for each of the N sets of ~500 ARCs, you'll have one > > index and one segment. That way, you can distribute the index-segment pairs > > across multiple disks or hosts. > > /search/indexes/indexA/ > > /search/indexes/indexB/ > > ... > > /search/segments/segmentA/ > > /search/segments/segmentB/ > > ... > > > > and point searcher.dir at /search. The webapp will then search all indexes > > under /search/indexes. Alternatively, you can merge all of the indexes as > > Stack pointed out. > > > > Hope this helps. > > > > -J > > > > > > > > On Oct 2, 2007, at 5:09 AM, Ignacio Garcia wrote: > > > > Hello, > > > > I tried separating the list of ARCs on smaller sets of ~500 ARCs. > > > > The first batch run to completion without problems, however, the second > > batch failed because I was using the same output directory as I used for the > > first one. > > > > Why can't I use the same output directory??? Wouldn't it make sense to > > have all the info the same place, so I can access everything at a time? > > > > How do I divide the collection in smaller portions and then combine > > everything on a single index? If I just keep everything separated I would > > loose a lot of time looking in different indexes and configuring the web-app > > to be able to look everywhere. > > > > On 9/28/07, Ignacio Garcia <igc...@gm...> wrote: > > > > > > Michael, I do not know if it failed on the same record... > > > > > > the first time it failed I assumed that increasing the -Xmx parameters > > > would solve it, since the OOME has happened before when indexing with > > > Wayback. > > > > > > I will try to narrow it as much as I can if it fails again. > > > > > > > > > On 9/27/07, Michael Stack < st...@du...> wrote: > > > > > > > > What John says and then > > > > > > > > + The OOME exception stack trace might tell us something. > > > > + Is the OOME always in same place processing same record? If so, > > > > take > > > > a look at it in the ARC. > > > > > > > > St.Ack > > > > > > > > John H. Lee wrote: > > > > > Hi Ignacio. > > > > > > > > > > It would be helpful if you posted the following information: > > > > > - Are you using standalone or mapreduce? > > > > > - If mapreduce, what are your mapred.map.tasks and > > > > > mapred.reduce.tasks properties set to? > > > > > - If mapreduce, how many slaves do you have and how much memory do > > > > > they have? > > > > > - How many ARCs are you trying to index? > > > > > - Did the map reach 100% completion before the failure occurred? > > > > > > > > > > Some things you may want to try: > > > > > - Set both -Xmx and -Xmx to the maximum available on your systems > > > > > - Increase one or both of mapred.map.tasks and mapred.reduce.tasks, > > > > > depending where the failure occurred > > > > > - Break your job up into smaller chunks of say, 1000 or 5000 ARCs > > > > > > > > > > -J > > > > > > > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > > > > > > > > > > > > > > >> Hello, > > > > >> > > > > >> I've been doing some testing with nutchwax and I have never had any > > > > >> major problems. > > > > >> However, right now I am trying to index a collection that is over > > > > >> 100 Gb big, and for some reason the indexing is crashing while it > > > > >> tries to populate 'crawldb' > > > > >> > > > > >> The job will run fine at the beginning importing the information > > > > >> from the ARCs and creating the "segments" section. > > > > >> > > > > >> The error I get is an outOfMemory error when the system is > > > > >> processing each of the part.xx in the segments previously created. > > > > >> > > > > >> I tried increasing the following setting on the hadoop-default.xml > > > > >> config file: mapred.child.java.opts to 1GB, but it still failed in > > > > >> the same part. > > > > >> > > > > >> Is there any way to reduce the amount of memory used by nutchwax/ > > > > >> hadoop to make the process more efficient and be able to index such > > > > >> a collection? > > > > >> > > > > >> Thank you. > > > > >> > > > > ---------------------------------------------------------------------- > > > > >> --- > > > > >> This SF.net email is sponsored by: Microsoft > > > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > >> _______________________________________________ > > > > >> Archive-access-discuss mailing list > > > > >> Arc...@li... > > > > >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > >> > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > > > This SF.net email is sponsored by: Microsoft > > > > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > > _______________________________________________ > > > > > Archive-access-discuss mailing list > > > > > Arc...@li... > > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > > > > > > > > > > > > > -- Harvard University Library Powered by Open WebMail |
|
From: Ignacio G. <igc...@gm...> - 2007-10-05 17:11:30
|
MDcvMTAvMDUgMTM6MTk6NDggSU5GTyBtYXByZWQuTG9jYWxKb2JSdW5uZXI6IC9udXRjaHdheF9p bmRleGVzL2luZGV4ZXMvc2VnbWVudHMvMjAwNzEwMDIwOTAzMjktZWwyMDAwL2NyYXdsX3BhcnNl L3BhcnQtMDAwMDA6NTk3OTM5OTc4MjQrMzM1NTQ0MzINCjA3LzEwLzA1IDEzOjE5OjQ5IElORk8g bWFwcmVkLkxvY2FsSm9iUnVubmVyOiAvbnV0Y2h3YXhfaW5kZXhlcy9pbmRleGVzL3NlZ21lbnRz LzIwMDcxMDAyMDkwMzI5LWVsMjAwMC9jcmF3bF9wYXJzZS9wYXJ0LTAwMDAwOjU5NzkzOTk3ODI0 KzMzNTU0NDMyDQowNy8xMC8wNSAxMzoxOTo1MCBJTkZPIG1hcHJlZC5Mb2NhbEpvYlJ1bm5lcjog L251dGNod2F4X2luZGV4ZXMvaW5kZXhlcy9zZWdtZW50cy8yMDA3MTAwMjA5MDMyOS1lbDIwMDAv Y3Jhd2xfcGFyc2UvcGFydC0wMDAwMDo1OTc5Mzk5NzgyNCszMzU1NDQzMg0KMDcvMTAvMDUgMTM6 MTk6NTEgSU5GTyBtYXByZWQuTG9jYWxKb2JSdW5uZXI6IC9udXRjaHdheF9pbmRleGVzL2luZGV4 ZXMvc2VnbWVudHMvMjAwNzEwMDIwOTAzMjktZWwyMDAwL2NyYXdsX3BhcnNlL3BhcnQtMDAwMDA6 NTk3OTM5OTc4MjQrMzM1NTQ0MzINCjA3LzEwLzA1IDEzOjE5OjUyIElORk8gbWFwcmVkLkxvY2Fs Sm9iUnVubmVyOiAvbnV0Y2h3YXhfaW5kZXhlcy9pbmRleGVzL3NlZ21lbnRzLzIwMDcxMDAyMDkw MzI5LWVsMjAwMC9jcmF3bF9wYXJzZS9wYXJ0LTAwMDAwOjU5NzkzOTk3ODI0KzMzNTU0NDMyDQow Ny8xMC8wNSAxMzoxOTo1NCBJTkZPIG1hcHJlZC5Mb2NhbEpvYlJ1bm5lcjogL251dGNod2F4X2lu ZGV4ZXMvaW5kZXhlcy9zZWdtZW50cy8yMDA3MTAwMjA5MDMyOS1lbDIwMDAvY3Jhd2xfcGFyc2Uv cGFydC0wMDAwMDo1OTc5Mzk5NzgyNCszMzU1NDQzMg0KMDcvMTAvMDUgMTM6MTk6NTUgSU5GTyBt YXByZWQuTG9jYWxKb2JSdW5uZXI6IC9udXRjaHdheF9pbmRleGVzL2luZGV4ZXMvc2VnbWVudHMv MjAwNzEwMDIwOTAzMjktZWwyMDAwL2NyYXdsX3BhcnNlL3BhcnQtMDAwMDA6NTk3OTM5OTc4MjQr MzM1NTQ0MzINCjA3LzEwLzA1IDEzOjE5OjU2IElORk8gbWFwcmVkLkxvY2FsSm9iUnVubmVyOiAv bnV0Y2h3YXhfaW5kZXhlcy9pbmRleGVzL3NlZ21lbnRzLzIwMDcxMDAyMDkwMzI5LWVsMjAwMC9j cmF3bF9wYXJzZS9wYXJ0LTAwMDAwOjU5NzkzOTk3ODI0KzMzNTU0NDMyDQowNy8xMC8wNSAxMzox OTo1NyBJTkZPIG1hcHJlZC5Mb2NhbEpvYlJ1bm5lcjogL251dGNod2F4X2luZGV4ZXMvaW5kZXhl cy9zZWdtZW50cy8yMDA3MTAwMjA5MDMyOS1lbDIwMDAvY3Jhd2xfcGFyc2UvcGFydC0wMDAwMDo1 OTc5Mzk5NzgyNCszMzU1NDQzMg0KMDcvMTAvMDUgMTM6MTk6NTggSU5GTyBtYXByZWQuTG9jYWxK b2JSdW5uZXI6IC9udXRjaHdheF9pbmRleGVzL2luZGV4ZXMvc2VnbWVudHMvMjAwNzEwMDIwOTAz MjktZWwyMDAwL2NyYXdsX3BhcnNlL3BhcnQtMDAwMDA6NTk3OTM5OTc4MjQrMzM1NTQ0MzINCjA3 LzEwLzA1IDEzOjE5OjU5IElORk8gbWFwcmVkLkxvY2FsSm9iUnVubmVyOiAvbnV0Y2h3YXhfaW5k ZXhlcy9pbmRleGVzL3NlZ21lbnRzLzIwMDcxMDAyMDkwMzI5LWVsMjAwMC9jcmF3bF9wYXJzZS9w YXJ0LTAwMDAwOjU5NzkzOTk3ODI0KzMzNTU0NDMyDQowNy8xMC8wNSAxMzoyMDowMCBJTkZPIG1h cHJlZC5Mb2NhbEpvYlJ1bm5lcjogL251dGNod2F4X2luZGV4ZXMvaW5kZXhlcy9zZWdtZW50cy8y MDA3MTAwMjA5MDMyOS1lbDIwMDAvY3Jhd2xfcGFyc2UvcGFydC0wMDAwMDo1OTc5Mzk5NzgyNCsz MzU1NDQzMg0KMDcvMTAvMDUgMTM6MjA6MDEgSU5GTyBtYXByZWQuTG9jYWxKb2JSdW5uZXI6IC9u dXRjaHdheF9pbmRleGVzL2luZGV4ZXMvc2VnbWVudHMvMjAwNzEwMDIwOTAzMjktZWwyMDAwL2Ny YXdsX3BhcnNlL3BhcnQtMDAwMDA6NTk3OTM5OTc4MjQrMzM1NTQ0MzINCjA3LzEwLzA1IDEzOjIw OjAyIElORk8gbWFwcmVkLkxvY2FsSm9iUnVubmVyOiAvbnV0Y2h3YXhfaW5kZXhlcy9pbmRleGVz L3NlZ21lbnRzLzIwMDcxMDAyMDkwMzI5LWVsMjAwMC9jcmF3bF9wYXJzZS9wYXJ0LTAwMDAwOjU5 NzkzOTk3ODI0KzMzNTU0NDMyDQowNy8xMC8wNSAxMzoyMDowMyBJTkZPIG1hcHJlZC5Mb2NhbEpv YlJ1bm5lcjogL251dGNod2F4X2luZGV4ZXMvaW5kZXhlcy9zZWdtZW50cy8yMDA3MTAwMjA5MDMy OS1lbDIwMDAvY3Jhd2xfcGFyc2UvcGFydC0wMDAwMDo1OTc5Mzk5NzgyNCszMzU1NDQzMg0KMDcv MTAvMDUgMTM6MjA6MDQgSU5GTyBtYXByZWQuTG9jYWxKb2JSdW5uZXI6IC9udXRjaHdheF9pbmRl eGVzL2luZGV4ZXMvc2VnbWVudHMvMjAwNzEwMDIwOTAzMjktZWwyMDAwL2NyYXdsX3BhcnNlL3Bh cnQtMDAwMDA6NTk3OTM5OTc4MjQrMzM1NTQ0MzINCjA3LzEwLzA1IDEzOjIwOjA1IElORk8gbWFw cmVkLkxvY2FsSm9iUnVubmVyOiAvbnV0Y2h3YXhfaW5kZXhlcy9pbmRleGVzL3NlZ21lbnRzLzIw MDcxMDAyMDkwMzI5LWVsMjAwMC9jcmF3bF9wYXJzZS9wYXJ0LTAwMDAwOjU5NzkzOTk3ODI0KzMz NTU0NDMyDQowNy8xMC8wNSAxMzoyMDowNiBJTkZPIG1hcHJlZC5Mb2NhbEpvYlJ1bm5lcjogL251 dGNod2F4X2luZGV4ZXMvaW5kZXhlcy9zZWdtZW50cy8yMDA3MTAwMjA5MDMyOS1lbDIwMDAvY3Jh d2xfcGFyc2UvcGFydC0wMDAwMDo1OTc5Mzk5NzgyNCszMzU1NDQzMg0KMDcvMTAvMDUgMTM6MjA6 MDYgSU5GTyBjb25mLkNvbmZpZ3VyYXRpb246IHBhcnNpbmcgZmlsZTovaG9tZS9pZ2FyY2lhL2Fw cHMvaGFkb29wL2NvbmYvaGFkb29wLWRlZmF1bHQueG1sDQowNy8xMC8wNSAxMzoyMDowNiBJTkZP IGNvbmYuQ29uZmlndXJhdGlvbjogcGFyc2luZyBmaWxlOi9ob21lL2lnYXJjaWEvYXBwcy9oYWRv b3AvY29uZi9tYXByZWQtZGVmYXVsdC54bWwNCjA3LzEwLzA1IDEzOjIwOjA2IElORk8gY29uZi5D b25maWd1cmF0aW9uOiBwYXJzaW5nIC92b2wvd2ViY2FwdHVyZS9udXRjaHdheF9pbmRleGVzL3Rt cC9oYWRvb3Atcm9vdC9tYXByZWQvbG9jYWwvbG9jYWxSdW5uZXIvam9iX2swdnczaS54bWwNCjA3 LzEwLzA1IDEzOjIwOjA2IElORk8gY29uZi5Db25maWd1cmF0aW9uOiBwYXJzaW5nIGZpbGU6L2hv bWUvaWdhcmNpYS9hcHBzL2hhZG9vcC9jb25mL21hcHJlZC1kZWZhdWx0LnhtbA0KMDcvMTAvMDUg MTM6MjA6MDYgSU5GTyBtYXByZWQuTWFwVGFzazogb3BlbmVkIHBhcnQtMC5vdXQNCjA3LzEwLzA1 IDEzOjIwOjA2IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRvcnk6IFBsdWdpbnM6IGxvb2tpbmcg aW46IC90bXAvaGFkb29wLXVuamFyMzY0OTMvd2F4LXBsdWdpbnMNCjA3LzEwLzA1IDEzOjIwOjA2 IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRvcnk6IFBsdWdpbnM6IGxvb2tpbmcgaW46IC90bXAv aGFkb29wLXVuamFyMzY0OTMvcGx1Z2lucw0KMDcvMTAvMDUgMTM6MjA6MDYgSU5GTyBwbHVnaW4u UGx1Z2luUmVwb3NpdG9yeTogUGx1Z2luIEF1dG8tYWN0aXZhdGlvbiBtb2RlOiBbdHJ1ZV0NCjA3 LzEwLzA1IDEzOjIwOjA2IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRvcnk6IFJlZ2lzdGVyZWQg UGx1Z2luczoNCjA3LzEwLzA1IDEzOjIwOjA2IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRvcnk6 ICAgICAgICAgQ3liZXJOZWtvIEhUTUwgUGFyc2VyIChsaWItbmVrb2h0bWwpDQowNy8xMC8wNSAx MzoyMDowNiBJTkZPIHBsdWdpbi5QbHVnaW5SZXBvc2l0b3J5OiAgICAgICAgIFNpdGUgUXVlcnkg RmlsdGVyIChxdWVyeS1zaXRlKQ0KMDcvMTAvMDUgMTM6MjA6MDYgSU5GTyBwbHVnaW4uUGx1Z2lu UmVwb3NpdG9yeTogICAgICAgICBNZXRhZGF0YS1vbmx5IHBhcnNlciAocGFyc2UtZGVmYXVsdCkN CjA3LzEwLzA1IDEzOjIwOjA2IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRvcnk6ICAgICAgICAg TnV0Y2hXQVggUXVlcnkgRmlsdGVyIChxdWVyeS13YXgpDQowNy8xMC8wNSAxMzoyMDowNiBJTkZP IHBsdWdpbi5QbHVnaW5SZXBvc2l0b3J5OiAgICAgICAgIEJhc2ljIFVSTCBOb3JtYWxpemVyICh1 cmxub3JtYWxpemVyLWJhc2ljKQ0KMDcvMTAvMDUgMTM6MjA6MDYgSU5GTyBwbHVnaW4uUGx1Z2lu UmVwb3NpdG9yeTogICAgICAgICBIdG1sIFBhcnNlIFBsdWctaW4gKHBhcnNlLWh0bWwpDQowNy8x MC8wNSAxMzoyMDowNiBJTkZPIHBsdWdpbi5QbHVnaW5SZXBvc2l0b3J5OiAgICAgICAgIFBhc3Mt dGhyb3VnaCBVUkwgTm9ybWFsaXplciAodXJsbm9ybWFsaXplci1wYXNzKQ0KMDcvMTAvMDUgMTM6 MjA6MDYgSU5GTyBwbHVnaW4uUGx1Z2luUmVwb3NpdG9yeTogICAgICAgICBKYWthcnRhIENvbW1v bnMgSFRUUCBDbGllbnQgKGxpYi1jb21tb25zLWh0dHBjbGllbnQpDQowNy8xMC8wNSAxMzoyMDow NiBJTkZPIHBsdWdpbi5QbHVnaW5SZXBvc2l0b3J5OiAgICAgICAgIFJlZ2V4IFVSTCBGaWx0ZXIg RnJhbWV3b3JrIChsaWItcmVnZXgtZmlsdGVyKQ0KMDcvMTAvMDUgMTM6MjA6MDYgSU5GTyBwbHVn aW4uUGx1Z2luUmVwb3NpdG9yeTogICAgICAgICBCYXNpYyBJbmRleGluZyBGaWx0ZXIgKGluZGV4 LWJhc2ljKQ0KMDcvMTAvMDUgMTM6MjA6MDYgSU5GTyBwbHVnaW4uUGx1Z2luUmVwb3NpdG9yeTog ICAgICAgICBQZGYgUGFyc2UgUGx1Zy1pbiAocGFyc2UtcGRmKQ0KMDcvMTAvMDUgMTM6MjA6MDYg SU5GTyBwbHVnaW4uUGx1Z2luUmVwb3NpdG9yeTogICAgICAgICBCYXNpYyBTdW1tYXJpemVyIFBs dWctaW4gKHN1bW1hcnktYmFzaWMpDQowNy8xMC8wNSAxMzoyMDowNiBJTkZPIHBsdWdpbi5QbHVn aW5SZXBvc2l0b3J5OiAgICAgICAgIE51dGNoV0FYIEluZGV4aW5nIEZpbHRlciAoaW5kZXgtd2F4 KQ0KMDcvMTAvMDUgMTM6MjA6MDYgSU5GTyBwbHVnaW4uUGx1Z2luUmVwb3NpdG9yeTogICAgICAg ICBUZXh0IFBhcnNlIFBsdWctaW4gKHBhcnNlLXRleHQpDQowNy8xMC8wNSAxMzoyMDowNiBJTkZP IHBsdWdpbi5QbHVnaW5SZXBvc2l0b3J5OiAgICAgICAgIFJlZ2V4IFVSTCBGaWx0ZXIgKHVybGZp bHRlci1yZWdleCkNCjA3LzEwLzA1IDEzOjIwOjA2IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRv cnk6ICAgICAgICAgQmFzaWMgUXVlcnkgRmlsdGVyIChxdWVyeS1iYXNpYykNCjA3LzEwLzA1IDEz OjIwOjA2IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRvcnk6ICAgICAgICAgWE1MIExpYnJhcmll cyAobGliLXhtbCkNCjA3LzEwLzA1IDEzOjIwOjA2IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRv cnk6ICAgICAgICAgVVJMIFF1ZXJ5IEZpbHRlciAocXVlcnktdXJsKQ0KMDcvMTAvMDUgMTM6MjA6 MDYgSU5GTyBwbHVnaW4uUGx1Z2luUmVwb3NpdG9yeTogICAgICAgICBSZWdleCBVUkwgTm9ybWFs aXplciAodXJsbm9ybWFsaXplci1yZWdleCkNCjA3LzEwLzA1IDEzOjIwOjA2IElORk8gcGx1Z2lu LlBsdWdpblJlcG9zaXRvcnk6ICAgICAgICAgTG9nNGogKGxpYi1sb2c0aikNCjA3LzEwLzA1IDEz OjIwOjA2IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRvcnk6ICAgICAgICAgSG9zdCBRdWVyeSBG aWx0ZXIgKHF1ZXJ5LWhvc3QpDQowNy8xMC8wNSAxMzoyMDowNiBJTkZPIHBsdWdpbi5QbHVnaW5S ZXBvc2l0b3J5OiAgICAgICAgIFJTUyBQYXJzZSBQbHVnLWluIChwYXJzZS1yc3MpDQowNy8xMC8w NSAxMzoyMDowNiBJTkZPIHBsdWdpbi5QbHVnaW5SZXBvc2l0b3J5OiAgICAgICAgIHRoZSBudXRj aCBjb3JlIGV4dGVuc2lvbiBwb2ludHMgKG51dGNoLWV4dGVuc2lvbnBvaW50cykNCjA3LzEwLzA1 IDEzOjIwOjA2IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRvcnk6ICAgICAgICAgT1BJQyBTY29y aW5nIFBsdWctaW4gKHNjb3Jpbmctb3BpYykNCjA3LzEwLzA1IDEzOjIwOjA2IElORk8gcGx1Z2lu LlBsdWdpblJlcG9zaXRvcnk6IFJlZ2lzdGVyZWQgRXh0ZW5zaW9uLVBvaW50czoNCjA3LzEwLzA1 IDEzOjIwOjA2IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRvcnk6ICAgICAgICAgTnV0Y2ggU3Vt bWFyaXplciAob3JnLmFwYWNoZS5udXRjaC5zZWFyY2hlci5TdW1tYXJpemVyKQ0KMDcvMTAvMDUg MTM6MjA6MDYgSU5GTyBwbHVnaW4uUGx1Z2luUmVwb3NpdG9yeTogICAgICAgICBOdXRjaCBTY29y aW5nIChvcmcuYXBhY2hlLm51dGNoLnNjb3JpbmcuU2NvcmluZ0ZpbHRlcikNCjA3LzEwLzA1IDEz OjIwOjA2IElORk8gcGx1Z2luLlBsdWdpblJlcG9zaXRvcnk6ICAgICAgICAgTnV0Y2ggUHJvdG9j b2wgKG9yZy5hcGFjaGUubnV0Y2gucHJvdG9jb2wuUHJvdG9jb2wpDQowNy8xMC8wNSAxMzoyMDow NiBJTkZPIHBsdWdpbi5QbHVnaW5SZXBvc2l0b3J5OiAgICAgICAgIE51dGNoIFVSTCBOb3JtYWxp emVyIChvcmcuYXBhY2hlLm51dGNoLm5ldC5VUkxOb3JtYWxpemVyKQ0KMDcvMTAvMDUgMTM6MjA6 MDYgSU5GTyBwbHVnaW4uUGx1Z2luUmVwb3NpdG9yeTogICAgICAgICBOdXRjaCBVUkwgRmlsdGVy IChvcmcuYXBhY2hlLm51dGNoLm5ldC5VUkxGaWx0ZXIpDQowNy8xMC8wNSAxMzoyMDowNiBJTkZP IHBsdWdpbi5QbHVnaW5SZXBvc2l0b3J5OiAgICAgICAgIEhUTUwgUGFyc2UgRmlsdGVyIChvcmcu YXBhY2hlLm51dGNoLnBhcnNlLkh0bWxQYXJzZUZpbHRlcikNCjA3LzEwLzA1IDEzOjIwOjA2IElO Rk8gcGx1Z2luLlBsdWdpblJlcG9zaXRvcnk6ICAgICAgICAgTnV0Y2ggT25saW5lIFNlYXJjaCBS ZXN1bHRzIENsdXN0ZXJpbmcgUGx1Z2luIChvcmcuYXBhY2hlLm51dGNoLmNsdXN0ZXJpbmcuT25s aW5lQ2x1c3RlcmVyKQ0KMDcvMTAvMDUgMTM6MjA6MDYgSU5GTyBwbHVnaW4uUGx1Z2luUmVwb3Np dG9yeTogICAgICAgICBOdXRjaCBJbmRleGluZyBGaWx0ZXIgKG9yZy5hcGFjaGUubnV0Y2guaW5k ZXhlci5JbmRleGluZ0ZpbHRlcikNCjA3LzEwLzA1IDEzOjIwOjA2IElORk8gcGx1Z2luLlBsdWdp blJlcG9zaXRvcnk6ICAgICAgICAgTnV0Y2ggQ29udGVudCBQYXJzZXIgKG9yZy5hcGFjaGUubnV0 Y2gucGFyc2UuUGFyc2VyKQ0KMDcvMTAvMDUgMTM6MjA6MDYgSU5GTyBwbHVnaW4uUGx1Z2luUmVw b3NpdG9yeTogICAgICAgICBPbnRvbG9neSBNb2RlbCBMb2FkZXIgKG9yZy5hcGFjaGUubnV0Y2gu b250b2xvZ3kuT250b2xvZ3kpDQowNy8xMC8wNSAxMzoyMDowNiBJTkZPIHBsdWdpbi5QbHVnaW5S ZXBvc2l0b3J5OiAgICAgICAgIE51dGNoIEFuYWx5c2lzIChvcmcuYXBhY2hlLm51dGNoLmFuYWx5 c2lzLk51dGNoQW5hbHl6ZXIpDQowNy8xMC8wNSAxMzoyMDowNiBJTkZPIHBsdWdpbi5QbHVnaW5S ZXBvc2l0b3J5OiAgICAgICAgIE51dGNoIFF1ZXJ5IEZpbHRlciAob3JnLmFwYWNoZS5udXRjaC5z ZWFyY2hlci5RdWVyeUZpbHRlcikNCjA3LzEwLzA1IDEzOjIwOjA2IElORk8gY29uZi5Db25maWd1 cmF0aW9uOiBmb3VuZCByZXNvdXJjZSByZWdleC11cmxmaWx0ZXIudHh0IGF0IGZpbGU6L3RtcC9o YWRvb3AtdW5qYXIzNjQ5My9yZWdleC11cmxmaWx0ZXIudHh0DQowNy8xMC8wNSAxMzoyMDoxMSBJ TkZPIG1hcHJlZC5Kb2JDbGllbnQ6ICBtYXAgMTAwJSByZWR1Y2UgMCUNCjA3LzEwLzA1IDEzOjIw OjE5IFdBUk4gbWFwcmVkLkxvY2FsSm9iUnVubmVyOiBqb2JfazB2dzNpDQpqYXZhLmxhbmcuT3V0 T2ZNZW1vcnlFcnJvcjogUGVybUdlbiBzcGFjZQ0KRXhjZXB0aW9uIGluIHRocmVhZCAibWFpbiIg amF2YS5pby5JT0V4Y2VwdGlvbjogSm9iIGZhaWxlZCENCiAgICAgICAgYXQgb3JnLmFwYWNoZS5o YWRvb3AubWFwcmVkLkpvYkNsaWVudC5ydW5Kb2IoSm9iQ2xpZW50LmphdmE6Mzk5KQ0KICAgICAg ICBhdCBvcmcuYXJjaGl2ZS5hY2Nlc3MubnV0Y2guTnV0Y2h3YXhDcmF3bERiLnVwZGF0ZShOdXRj aHdheENyYXdsRGIuamF2YToxMDQpDQogICAgICAgIGF0IG9yZy5hcGFjaGUubnV0Y2guY3Jhd2wu Q3Jhd2xEYi51cGRhdGUoQ3Jhd2xEYi5qYXZhOjYyKQ0KICAgICAgICBhdCBvcmcuYXJjaGl2ZS5h Y2Nlc3MubnV0Y2guTnV0Y2h3YXguZG9VcGRhdGUoTnV0Y2h3YXguamF2YToyMDEpDQogICAgICAg IGF0IG9yZy5hcmNoaXZlLmFjY2Vzcy5udXRjaC5OdXRjaHdheC5kb1VwZGF0ZShOdXRjaHdheC5q YXZhOjE3NCkNCiAgICAgICAgYXQgb3JnLmFyY2hpdmUuYWNjZXNzLm51dGNoLk51dGNod2F4LmRv QWxsKE51dGNod2F4LmphdmE6MTUzKQ0KICAgICAgICBhdCBvcmcuYXJjaGl2ZS5hY2Nlc3MubnV0 Y2guTnV0Y2h3YXguZG9Kb2IoTnV0Y2h3YXguamF2YTozODkpDQogICAgICAgIGF0IG9yZy5hcmNo aXZlLmFjY2Vzcy5udXRjaC5OdXRjaHdheC5tYWluKE51dGNod2F4LmphdmE6Njc0KQ0KICAgICAg ICBhdCBzdW4ucmVmbGVjdC5OYXRpdmVNZXRob2RBY2Nlc3NvckltcGwuaW52b2tlMChOYXRpdmUg TWV0aG9kKQ0KICAgICAgICBhdCBzdW4ucmVmbGVjdC5OYXRpdmVNZXRob2RBY2Nlc3NvckltcGwu aW52b2tlKFVua25vd24gU291cmNlKQ0KICAgICAgICBhdCBzdW4ucmVmbGVjdC5EZWxlZ2F0aW5n TWV0aG9kQWNjZXNzb3JJbXBsLmludm9rZShVbmtub3duIFNvdXJjZSkNCiAgICAgICAgYXQgamF2 YS5sYW5nLnJlZmxlY3QuTWV0aG9kLmludm9rZShVbmtub3duIFNvdXJjZSkNCiAgICAgICAgYXQg b3JnLmFwYWNoZS5oYWRvb3AudXRpbC5SdW5KYXIubWFpbihSdW5KYXIuamF2YToxNDkpDQo= |
|
From: John H. L. <jl...@ar...> - 2007-10-02 17:47:51
|
The idea is that for each of the N sets of ~500 ARCs, you'll have one index and one segment. That way, you can distribute the index-segment pairs across multiple disks or hosts. /search/indexes/indexA/ /search/indexes/indexB/ ... /search/segments/segmentA/ /search/segments/segmentB/ ... and point searcher.dir at /search. The webapp will then search all indexes under /search/indexes. Alternatively, you can merge all of the indexes as Stack pointed out. Hope this helps. -J On Oct 2, 2007, at 5:09 AM, Ignacio Garcia wrote: > Hello, > > I tried separating the list of ARCs on smaller sets of ~500 ARCs. > > The first batch run to completion without problems, however, the > second batch failed because I was using the same output directory > as I used for the first one. > > Why can't I use the same output directory??? Wouldn't it make sense > to have all the info the same place, so I can access everything at > a time? > > How do I divide the collection in smaller portions and then combine > everything on a single index? If I just keep everything separated I > would loose a lot of time looking in different indexes and > configuring the web-app to be able to look everywhere. > > On 9/28/07, Ignacio Garcia <igc...@gm...> wrote: > Michael, I do not know if it failed on the same record... > > the first time it failed I assumed that increasing the -Xmx > parameters would solve it, since the OOME has happened before when > indexing with Wayback. > > I will try to narrow it as much as I can if it fails again. > > > > On 9/27/07, Michael Stack < st...@du...> wrote: > What John says and then > > + The OOME exception stack trace might tell us something. > + Is the OOME always in same place processing same record? If so, > take > a look at it in the ARC. > > St.Ack > > John H. Lee wrote: > > Hi Ignacio. > > > > It would be helpful if you posted the following information: > > - Are you using standalone or mapreduce? > > - If mapreduce, what are your mapred.map.tasks and > > mapred.reduce.tasks properties set to? > > - If mapreduce, how many slaves do you have and how much memory do > > they have? > > - How many ARCs are you trying to index? > > - Did the map reach 100% completion before the failure occurred? > > > > Some things you may want to try: > > - Set both -Xmx and -Xmx to the maximum available on your systems > > - Increase one or both of mapred.map.tasks and mapred.reduce.tasks, > > depending where the failure occurred > > - Break your job up into smaller chunks of say, 1000 or 5000 ARCs > > > > -J > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > > > > > >> Hello, > >> > >> I've been doing some testing with nutchwax and I have never had any > >> major problems. > >> However, right now I am trying to index a collection that is over > >> 100 Gb big, and for some reason the indexing is crashing while it > >> tries to populate 'crawldb' > >> > >> The job will run fine at the beginning importing the information > >> from the ARCs and creating the "segments" section. > >> > >> The error I get is an outOfMemory error when the system is > >> processing each of the part.xx in the segments previously created. > >> > >> I tried increasing the following setting on the hadoop-default.xml > >> config file: mapred.child.java.opts to 1GB, but it still failed in > >> the same part. > >> > >> Is there any way to reduce the amount of memory used by nutchwax/ > >> hadoop to make the process more efficient and be able to index such > >> a collection? > >> > >> Thank you. > >> > ---------------------------------------------------------------------- > >> --- > >> This SF.net email is sponsored by: Microsoft > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > >> _______________________________________________ > >> Archive-access-discuss mailing list > >> Arc...@li... > >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >> > > > > > > > ---------------------------------------------------------------------- > --- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > |
|
From: Michael S. <st...@du...> - 2007-10-02 15:46:56
|
IIRC, there are explicit checks to prevent overwriting any extant content in the specified output directory. Use the merge command to aggregate multiple indices (Pass a '--help' parameter for usage). You could also put off merging and just configure the webapp to look in multiple indices. St.Ack Ignacio Garcia wrote: > Hello, > > I tried separating the list of ARCs on smaller sets of ~500 ARCs. > > The first batch run to completion without problems, however, the > second batch failed because I was using the same output directory as I > used for the first one. > > Why can't I use the same output directory??? Wouldn't it make sense to > have all the info the same place, so I can access everything at a time? > > How do I divide the collection in smaller portions and then combine > everything on a single index? If I just keep everything separated I > would loose a lot of time looking in different indexes and configuring > the web-app to be able to look everywhere. > > On 9/28/07, *Ignacio Garcia* <igc...@gm... > <mailto:igc...@gm...>> wrote: > > Michael, I do not know if it failed on the same record... > > the first time it failed I assumed that increasing the -Xmx > parameters would solve it, since the OOME has happened before when > indexing with Wayback. > > I will try to narrow it as much as I can if it fails again. > > > > On 9/27/07, *Michael Stack* < st...@du... > <mailto:st...@du...>> wrote: > > What John says and then > > + The OOME exception stack trace might tell us something. > + Is the OOME always in same place processing same record? If > so, take > a look at it in the ARC. > > St.Ack > > John H. Lee wrote: > > Hi Ignacio. > > > > It would be helpful if you posted the following information: > > - Are you using standalone or mapreduce? > > - If mapreduce, what are your mapred.map.tasks and > > mapred.reduce.tasks properties set to? > > - If mapreduce, how many slaves do you have and how much > memory do > > they have? > > - How many ARCs are you trying to index? > > - Did the map reach 100% completion before the failure occurred? > > > > Some things you may want to try: > > - Set both -Xmx and -Xmx to the maximum available on your > systems > > - Increase one or both of mapred.map.tasks and > mapred.reduce.tasks, > > depending where the failure occurred > > - Break your job up into smaller chunks of say, 1000 or 5000 ARCs > > > > -J > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > > > > > >> Hello, > >> > >> I've been doing some testing with nutchwax and I have never > had any > >> major problems. > >> However, right now I am trying to index a collection that is > over > >> 100 Gb big, and for some reason the indexing is crashing > while it > >> tries to populate 'crawldb' > >> > >> The job will run fine at the beginning importing the > information > >> from the ARCs and creating the "segments" section. > >> > >> The error I get is an outOfMemory error when the system is > >> processing each of the part.xx in the segments previously > created. > >> > >> I tried increasing the following setting on the > hadoop-default.xml > >> config file: mapred.child.java.opts to 1GB, but it still > failed in > >> the same part. > >> > >> Is there any way to reduce the amount of memory used by > nutchwax/ > >> hadoop to make the process more efficient and be able to > index such > >> a collection? > >> > >> Thank you. > >> > ---------------------------------------------------------------------- > > >> --- > >> This SF.net email is sponsored by: Microsoft > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > <http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/> > >> _______________________________________________ > >> Archive-access-discuss mailing list > >> Arc...@li... > <mailto:Arc...@li...> > >> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > <https://lists.sourceforge.net/lists/listinfo/archive-access-discuss> > >> > > > > > > ------------------------------------------------------------------------- > > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > <http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/> > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > <mailto:Arc...@li...> > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > <https://lists.sourceforge.net/lists/listinfo/archive-access-discuss> > > > > > |
|
From: Michael S. <st...@du...> - 2007-10-02 15:34:27
|
See here for discussion of ARCReader and other such tools: http://crawler.archive.org/articles/developer_manual/arcs.html. St.Ack Loren Gordon wrote: > Hello, > > Are there any tools for extracting the content of an ARC file and > writing it to disk so applications that can't read ARC files can mine > some of the data? Alexa's av_tools seemed to have some potential, but > I couldn't find any kind of download link for them. > > Thanks, > Loren > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Loren G. <ma...@lo...> - 2007-10-02 14:07:48
|
Hello, Are there any tools for extracting the content of an ARC file and writing it to disk so applications that can't read ARC files can mine some of the data? Alexa's av_tools seemed to have some potential, but I couldn't find any kind of download link for them. Thanks, Loren |
|
From: Ignacio G. <igc...@gm...> - 2007-10-02 12:10:08
|
Hello, I tried separating the list of ARCs on smaller sets of ~500 ARCs. The first batch run to completion without problems, however, the second batch failed because I was using the same output directory as I used for the first one. Why can't I use the same output directory??? Wouldn't it make sense to have all the info the same place, so I can access everything at a time? How do I divide the collection in smaller portions and then combine everything on a single index? If I just keep everything separated I would loose a lot of time looking in different indexes and configuring the web-app to be able to look everywhere. On 9/28/07, Ignacio Garcia <igc...@gm...> wrote: > > Michael, I do not know if it failed on the same record... > > the first time it failed I assumed that increasing the -Xmx parameters > would solve it, since the OOME has happened before when indexing with > Wayback. > > I will try to narrow it as much as I can if it fails again. > > > On 9/27/07, Michael Stack <st...@du...> wrote: > > > > What John says and then > > > > + The OOME exception stack trace might tell us something. > > + Is the OOME always in same place processing same record? If so, take > > a look at it in the ARC. > > > > St.Ack > > > > John H. Lee wrote: > > > Hi Ignacio. > > > > > > It would be helpful if you posted the following information: > > > - Are you using standalone or mapreduce? > > > - If mapreduce, what are your mapred.map.tasks and > > > mapred.reduce.tasks properties set to? > > > - If mapreduce, how many slaves do you have and how much memory do > > > they have? > > > - How many ARCs are you trying to index? > > > - Did the map reach 100% completion before the failure occurred? > > > > > > Some things you may want to try: > > > - Set both -Xmx and -Xmx to the maximum available on your systems > > > - Increase one or both of mapred.map.tasks and mapred.reduce.tasks, > > > depending where the failure occurred > > > - Break your job up into smaller chunks of say, 1000 or 5000 ARCs > > > > > > -J > > > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > > > > > > > > >> Hello, > > >> > > >> I've been doing some testing with nutchwax and I have never had any > > >> major problems. > > >> However, right now I am trying to index a collection that is over > > >> 100 Gb big, and for some reason the indexing is crashing while it > > >> tries to populate 'crawldb' > > >> > > >> The job will run fine at the beginning importing the information > > >> from the ARCs and creating the "segments" section. > > >> > > >> The error I get is an outOfMemory error when the system is > > >> processing each of the part.xx in the segments previously created. > > >> > > >> I tried increasing the following setting on the hadoop-default.xml > > >> config file: mapred.child.java.opts to 1GB, but it still failed in > > >> the same part. > > >> > > >> Is there any way to reduce the amount of memory used by nutchwax/ > > >> hadoop to make the process more efficient and be able to index such > > >> a collection? > > >> > > >> Thank you. > > >> > > ---------------------------------------------------------------------- > > >> --- > > >> This SF.net email is sponsored by: Microsoft > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > >> _______________________________________________ > > >> Archive-access-discuss mailing list > > >> Arc...@li... > > >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > >> > > > > > > > > > > > ------------------------------------------------------------------------- > > > This SF.net email is sponsored by: Microsoft > > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > _______________________________________________ > > > Archive-access-discuss mailing list > > > Arc...@li... > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > > > |