At the beggining I have reading all the records before passing it to a loop,
the I encapsulated the request in a while tag to limit the sql values.
But I still get some memory messages. The last one is the following
Exception in thread "main" java.lang.OutOfMemoryError: PermGen space
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Unknown Source)
at java.lang.Class.getConstructor0(Unknown Source)
at java.lang.Class.newInstance0(Unknown Source)
at java.lang.Class.newInstance(Unknown Source)
at org.codehaus.groovy.runtime.InvokerHelper.createScript(InvokerHelper.java:421)
at groovy.lang.GroovyShell.parse(GroovyShell.java:525)
at groovy.lang.GroovyShell.parse(GroovyShell.java:505)
at groovy.lang.GroovyShell.evaluate(GroovyShell.java:483)
at groovy.lang.GroovyShell.evaluate(GroovyShell.java:459)
at org.webharvest.runtime.scripting.GroovyScriptEngine.eval(Unknown Source)
at org.webharvest.runtime.templaters.BaseTemplater.execute(Unknown Source)
at org.webharvest.runtime.processors.CaseProcessor.execute(Unknown Source)
at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
at org.webharvest.runtime.processors.CaseProcessor.execute(Unknown Source)
at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
at org.webharvest.runtime.processors.EmptyProcessor.execute(Unknown Source)
at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
at org.webharvest.runtime.processors.LoopProcessor.execute(Unknown Source)
at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
at org.webharvest.runtime.processors.LoopProcessor.execute(Unknown Source)
at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
at org.webharvest.runtime.processors.WhileProcessor.execute(Unknown Source)
at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
Is there something more that I could do to flush all the memory after writting
the record ?
Regards
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I thought that the problem came from the database connection. As I open a new
connection each time I write a record, after a while (#900 records) the memory
problem occurs.
So I changed the ouptput method to an XML file. I read (call) once the
database to get the records then I pass them to a loop. But the result is an
other memory problem !
Exception in thread "main" java.lang.OutOfMemoryError: PermGen space
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$100(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4395)
at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1556)
at org.webharvest.runtime.Scraper.releaseDBConnections(Unknown Source)
at org.webharvest.runtime.Scraper.execute(Unknown Source)
at org.webharvest.runtime.Scraper.execute(Unknown Source)
at CommandLine.main(Unknown Source)
Any ideas ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would need to profile the scraper to answer the question why is this
happening. It could be normal - your JVM may really require a little bit more
permgen space to do its job. Or it could be a memory leak which might be
already fixed in 2.1 or may be not. Can you post your scraper xml here (would
be good if you remove all the irrelevant code before posting)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Speaking of object leakage, is there any automated testing that can help
uncover these types of issues in WH? And perhaps run SCA tools on the maven
build such as PMD and FindBugs? Below is a POM fragment that includes some
common SCA tools which you can easily add to the WH POM.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2011-11-20
Here is the code.
I Have adapted it to terminate the process passing a variable from a batch
file. But originaly I was requesting all the records (#500) and the final
result is about 16000 new records in the access DB.
<?xml version="1.0" encoding="UTF-8"?><configcharset="ISO-8859-1"scriptlang="groovy"><case><ifcondition='${sys.isVariableDefined("limit")}'><var-defname="rec"><databaseconnection="jdbc:mysql://localhost:3306/rawdata"jdbcclass="com.mysql.jdbc.Driver"username="root"password=""><template>
SELECT Param1, Param2, html
FROM `extract`
WHERE (`Method`='post') and (length(`Param3`)=0) and (`Param1` like '%2011')
LIMIT ${(limit.toInt()-1)*5},5
</template></database></var-def></if></case><!-- debut traitement --><loopitem="sem"empty="true"><list><varname="rec"/></list><body><!-- Lecture/Initialisation des variables --><empty><var-defname="anneeMont"><template>${sem.get('Param1').toString().substring(sem.get('Param1').toString().indexOf('=')+1,sem.get('Param1').toString().size())}</template></var-def><var-defname="Chenusem"><template>${sem.get('Param2').toString().substring(sem.get('Param2').toString().indexOf('=')+1,sem.get('Param2').toString().size())}</template></var-def><var-defname="xmlDoc"><html-to-xml><template>${sem.get('html')}</template></html-to-xml></var-def></empty><!-- Extraction des données HTML--><empty><var-defname="DateInfo"><xpathexpression="normalize-space(substring-after(//span[@class='nb11p' and contains(.,'la date du')] , 'du :'))"><varname="xmlDoc"/></xpath></var-def></empty><var-defname="nbMont"><xpathexpression="normalize-space(substring-before(//td[(contains(.,'juments trouv') or contains(.,'jument trouv')) and @class='nbb11p'] , 'juments'))"><varname="xmlDoc"/></xpath></var-def><!-- Traitement de la liste des Monts --><loopitem="Mont"empty="true"><list><xpathexpression="//table[@class='nB11p']//td[@class='nB11p' and @width='95%']"><varname="xmlDoc"/></xpath></list><body><!-- As there is a bug in the analysis of the expression with Xpath I do it with tokenize substring-after(substring-before(substring-after(//span[@class='nc']/a/@href , '?') , '&') , '=') --><empty><var-defname="ChenuYeg"><empty><var-defname="vChenu"><empty><var-defname="queryString"><tokenizedelimiters="&"><xpathexpression="substring-after(/td/span[@class='nc']/a/@href , '?')"><varname="Mont"/></xpath></tokenize></var-def></empty><tokenizedelimiters="="><template>${queryString.get(0).toString()}</template></tokenize></var-def></empty><tokenizedelimiters="="><template>${vChenu.get(2).toString()}</template></tokenize></var-def></empty><empty><var-defname="le"><xpathexpression="normalize-space(substring-before(//span[@class='nbb11px' and contains (.,'Mont')]/parent::*/text()[5], ' en'))"><varname="Mont"/></xpath></var-def><case><ifcondition="${le.toString().size()!=0}"><!-- Le ... --><var-defname="DuLe"><template>${le.toString()+'/'+anneeMont.toString()}</template></var-def><var-defname="au"/><var-defname="type"><xpathexpression="normalize-space(substring-after(//span[@class='nbb11px' and contains (.,'Mont')]/parent::*/text()[5], ' en'))"><varname="Mont"/></xpath></var-def></if><else><!-- Du --><var-defname="DuLe"><empty><var-defname="du"><xpathexpression="normalize-space(//span[@class='nbb11px' and contains (.,'Mont')]/parent::*/text()[5])"><varname="Mont"/></xpath></var-def></empty><template>${du.toString()+'/'+anneeMont.toString()}</template></var-def><var-defname="au"><empty><var-defname="tmp"><xpathexpression="normalize-space(substring-before(//span[@class='nbb11px' and contains (.,'Mont')]/parent::*/text()[6] ,' en'))"><varname="Mont"/></xpath></var-def></empty><template>${tmp.toString()+'/'+anneeMont.toString()}</template></var-def><var-defname="type"><xpathexpression="normalize-space(substring-after(//span[@class='nbb11px' and contains (.,'Mont')]/parent::*/text()[6], ' en'))"><varname="Mont"/></xpath></var-def></else></case></empty><!-- pas enregistré --><empty><var-defname="lnpot"><xpathexpression="normalize-space(//text()[contains(.,'pot')][1])"><varname="Mont"/></xpath></var-def><case><ifcondition="${lnpot.toString().size()!=0}"><empty><var-defname="Nompot"><xpathexpression="normalize-space(substring-before(//text()[contains(.,'pot')][1] ,'est né'))"><varname="Mont"/></xpath></var-def><var-defname="Chenupot"/><var-defname="DteNais"><xpathexpression="substring(normalize-space(substring-after(//text()[contains(.,'pot')][1],' le')),1,10)"><varname="Mont"/></xpath></var-def></empty></if><else><empty><var-defname="Nompot"><xpathexpression="normalize-space(//td[@class='nB11p' and @width='88%']//span[@class='nc']/a)"><varname="Mont"/></xpath></var-def></empty><case><ifcondition="${Nompot.toString().size()!=0}"><empty><var-defname="Chenupot"><empty><var-defname="vChenu"><empty><var-defname="queryString"><tokenizedelimiters="&"><xpathexpression="substring-after(//td[@class='nB11p' and @width='88%']//span[@class='nc']/a/@href , '?')"><varname="Mont"/></xpath></tokenize></var-def></empty><tokenizedelimiters="="><template>${queryString.get(0).toString()}</template></tokenize></var-def></empty><tokenizedelimiters="="><template>${vChenu.get(2).toString()}</template></tokenize></var-def></empty><empty><var-defname="DteNais"><xpathexpression="normalize-space(substring-after(//td[@class='nB11p' and @width='88%']/text()[position()=2] ,'(e) le'))"><varname="Mont"/></xpath></var-def></empty></if><else><empty><var-defname="Chenupot"/><var-defname="DteNais"/></empty></else></case></else></case></empty><empty><var-defname="mort"><xpathexpression="normalize-space(substring-after(//text()[contains(.,'mort le')],'mort le'))"><varname="Mont"/></xpath></var-def><case><ifcondition="${mort.toString().size()!=0}"><var-defname="mort"><template>${mort.toString()+'/'+anneeMont.toString()}</template></var-def></if><else><var-defname="mort"/></else></case></empty><empty><databaseconnection="jdbc:odbc:Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:/Users/Eric/Documents/ajeter.accdb"jdbcclass="sun.jdbc.odbc.JdbcOdbcDriver"><template>
insert into Monts (dateInfo, Chenusem, ChenuYeg, dule, au, type, Nompot, Chenupot, Dte_naissance, Dte_deces, nbMont)
values (
<db-param><varname="DateInfo"/></db-param>,
<db-param><varname="Chenusem"/></db-param>,
<db-param><varname="ChenuYeg"/></db-param>,
<db-param><varname="DuLe"/></db-param>,
<db-param><varname="au"/></db-param>,
<db-param><varname="type"/></db-param>,
<db-param><varname="Nompot"/></db-param>,
<db-param><varname="Chenupot"/></db-param>,
<db-param><varname="DteNais"/></db-param>,
<db-param><varname="mort"/></db-param>,
<db-param><varname="nbMont"/></db-param>
)
</template></database><!-- recherche d'un jumeau --><var-defname="lnJumeau"><xpathexpression="normalize-space(//text()[contains(.,'Jumeau')][1])"><varname="Mont"/></xpath></var-def><case><ifcondition="${lnJumeau.toString().size()!=0}"><var-defname="Nompot"><xpathexpression="normalize-space(//text()[contains(.,'pot')][2])"><varname="Mont"/></xpath></var-def><databaseconnection="jdbc:odbc:Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:/ajeter.accdb"jdbcclass="sun.jdbc.odbc.JdbcOdbcDriver"><template>
insert into Monts (dateInfo, Chenusem, ChenuYeg, dule, au, type, Nompot, Chenupot, Dte_naissance, Dte_deces, nbMont)
values (
<db-param><varname="DateInfo"/></db-param>,
<db-param><varname="Chenusem"/></db-param>,
<db-param><varname="ChenuYeg"/></db-param>,
<db-param><varname="DuLe"/></db-param>,
<db-param><varname="au"/></db-param>,
<db-param><varname="type"/></db-param>,
<db-param><varname="Nompot"/></db-param>,
<db-param><varname="Chenupot"/></db-param>,
<db-param><varname="DteNais"/></db-param>,
<db-param><varname="mort"/></db-param>,
<db-param><varname="nbMont"/></db-param>
)
</template></database></if></case></empty></body></loop></body></loop></config>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I used Beanshell script to convert the parsed data into a
List<Map<String,Object>> which is returned from. The following code expects a
WH list called calls reduced to the elements I need with XPath:
The just read it in with your Java program and write it off to the database
(note as mentioned in the forums that the <database> tag has issues):
ScraperConfiguration config = new ScraperConfiguration(properties.
getProperty("wh.scgov.active.calls.script"));
Scraper scraper = new Scraper(config, properties.getProperty(
"wh.config.path"));
scraper.addVariableToContext("userAgent", properties.getProperty(
"wh.user.agent"));
scraper.setDebug(false);
long startTime = System.currentTimeMillis();
scraper.execute();
log.info(String.format("Scraper.execute elapsed time: %d", System.
currentTimeMillis() - startTime));
Variable listVar = (Variable) scraper.getContext().getVar("callList");
// Make sure list returned from scraper is not null
assertNotNull("Variable returned from scraper cannot be null", listVar);
List<Variable> list = listVar.toList();
Map map = null;
Timestamp timestamp = null;
// Get DBUtils implementation of Access interface
DbAccess db = DbUtilsSingleton.getInstance();
// Get connection from BaseTest
final Connection connection = getDataSource().getConnection();
for (Variable var : list) {
// Get wrapped object (Map in this case)
map = (Map) var.getWrappedObject();
log.debug(map.toString());
// Convert java.util.Date to java.sql.Timestamp
timestamp = new Timestamp(((Date) map.get("event_time")).getTime());
// Insert disp_calls_temp record based on Map data
db.update(connection, getSqlMap().get("insert.disp.calls.temp"),
new Object[]{Integer.parseInt(
map.get("entity_id").toString()), map.get("event_num"),
timestamp, map.get("description"), map.get(
"location_main"), map.get("location_alt1"), map.get(
"location_alt2"), map.get("location_alt3"), map.get(
"case_num")});
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In addition to some issues related to <database> processor (which could be
easily fixed) I also see an architectural one which makes it impossible using
WH loops to process the bulk data. The problem is the lack of iterator
support, meaning that there is currently no way to traverse a sequence in a
steam line approach without creating a collection.
In the example below when SELECT is executed the entire result set is
traversed and all the rows are stored in the memory as a list which is then
passed to the <list> processor as a result value. So before the loop can start
processing the data all the data has to be stored in the memory.
I have already done some work on this issue introducing IteratorVariable to
WH. But the work is not finished yet. Now it only works for <file> tag (in
order to traverse through the list of files in all subdirectories). I need to
come up with more generic solution which would work for any WH processor.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2011-11-23
I am a bit confused with this dicussion...
heysteveo answered in the 5th position with something which seems different to
the initial question I submitted
and wajda79's answers seem to follow this subject ...
wajda79: have you had a look to the script ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, the topic degraded into can you make variables better. I told you what
works in 2.1 trunk for me. Offload the database work to a BeanShell script or
a calling Java class. If you use a Java class as a proxy object you can
persist the data using JDBC, Spring, CSV, etc. and not be tied to a specific
technology. It probably makes sense from a separation of concerns perspective
any ways not to parse directly into a database.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As I said I can think about two things that possibly might cause the out of
memory error: a) unclosed jdbc resources in <database> processor and (b) the
bulk data you are loading into the rec variable.
Steven suggested a solution (very reasonable btw) for the case a) while the
case b) is more tricky to workaround as it is an architectural limitation of
WH and that is what the iterable variables are being introduced for in WH 2.1.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am running the version 2.0 in console mode.
I'm reading a database to write an other one: I read an XML record to extract
data to an other table.
The script is full of <empty> tags like
or
etc...
At the beggining I have reading all the records before passing it to a loop,
the I encapsulated the request in a while tag to limit the sql values.
But I still get some memory messages. The last one is the following
Is there something more that I could do to flush all the memory after writting
the record ?
Regards
Reading your latest answers and especially in this thread
https://sourceforge.net/projects/web-
harvest/forums/forum/591299/topic/4793833 I thought I had found the
reason.
I thought that the problem came from the database connection. As I open a new
connection each time I write a record, after a while (#900 records) the memory
problem occurs.
So I changed the ouptput method to an XML file. I read (call) once the
database to get the records then I pass them to a loop. But the result is an
other memory problem !
Any ideas ?
I would need to profile the scraper to answer the question why is this
happening. It could be normal - your JVM may really require a little bit more
permgen space to do its job. Or it could be a memory leak which might be
already fixed in 2.1 or may be not. Can you post your scraper xml here (would
be good if you remove all the irrelevant code before posting)?
Check http://javarevisited.blogspot.com/2011/09/javalangoutofmemoryerror-
permgen-space.html
Speaking of object leakage, is there any automated testing that can help
uncover these types of issues in WH? And perhaps run SCA tools on the maven
build such as PMD and FindBugs? Below is a POM fragment that includes some
common SCA tools which you can easily add to the WH POM.
Cool! Will take a look at it.
Here is the code.
I Have adapted it to terminate the process passing a variable from a batch
file. But originaly I was requesting all the records (#500) and the final
result is about 16000 new records in the access DB.
I used Beanshell script to convert the parsed data into a
List<Map<String,Object>> which is returned from. The following code expects a
WH list called calls reduced to the elements I need with XPath:
The just read it in with your Java program and write it off to the database
(note as mentioned in the forums that the <database> tag has issues):
In addition to some issues related to <database> processor (which could be
easily fixed) I also see an architectural one which makes it impossible using
WH loops to process the bulk data. The problem is the lack of iterator
support, meaning that there is currently no way to traverse a sequence in a
steam line approach without creating a collection.
In the example below when SELECT is executed the entire result set is
traversed and all the rows are stored in the memory as a list which is then
passed to the <list> processor as a result value. So before the loop can start
processing the data all the data has to be stored in the memory.
I have already done some work on this issue introducing IteratorVariable to
WH. But the work is not finished yet. Now it only works for <file> tag (in
order to traverse through the list of files in all subdirectories). I need to
come up with more generic solution which would work for any WH processor.
I am a bit confused with this dicussion...
heysteveo answered in the 5th position with something which seems different to
the initial question I submitted
and wajda79's answers seem to follow this subject ...
wajda79: have you had a look to the script ?
Yes, the topic degraded into can you make variables better. I told you what
works in 2.1 trunk for me. Offload the database work to a BeanShell script or
a calling Java class. If you use a Java class as a proxy object you can
persist the data using JDBC, Spring, CSV, etc. and not be tied to a specific
technology. It probably makes sense from a separation of concerns perspective
any ways not to parse directly into a database.
As I said I can think about two things that possibly might cause the out of
memory error: a) unclosed jdbc resources in <database> processor and (b) the
bulk data you are loading into the rec variable.
Steven suggested a solution (very reasonable btw) for the case a) while the
case b) is more tricky to workaround as it is an architectural limitation of
WH and that is what the iterable variables are being introduced for in WH 2.1.