I am using html paser v 1.3 for parsing html. For some reason, its ignoring the <input> tags inside table. Is something I am missing??
Program, sample file and output are like:
//if the form doesn't have any name lets assumesome
if((formName == null) || (formName.equals(""))){
formName="MYFORMWITHNONAMEONIT";
}
System.out.println("Parsing form "+formName);
boolean firstpage=false;
NodeList inputtags=formTag.getFormInputs();
I find in most of these anomalous cases the table is not correctly formed and the row or column has consumed too little or too much, but in this case it looks too simple to have screwed up.
As a workaround, try using just a bald "Node [] list = parser.extractAllNodesThatAre(InputTag.class);".
In any case it looks like you should file a bug report.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am using html paser v 1.3 for parsing html. For some reason, its ignoring the <input> tags inside table. Is something I am missing??
Program, sample file and output are like:
<pre>
Java Program:
import org.htmlparser.tags.*;
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.*;
import org.htmlparser.tags.*;
import org.htmlparser.scanners.*;
import java.util.*;
public class HTMLParserbugtest{
private Parser parser;
public static void main(String[] args){
if(args.length>0){
HTMLParserbugtest parsetest=new HTMLParserbugtest();
parsetest.IntializeParser(args[0]);
}
else{
System.out.println("Give me something to parse");
}
}
//Initialize the html parser
private void IntializeParser(String location){
try{
this.parser = new Parser(location,null);
this.parser.registerScanners();
this.extractInputTags();
}
catch(ParserException pe){
pe.printStackTrace(System.err);
}
}
//extract from the html what we want
private void extractInputTags(){
try{
for(NodeIterator nodes= this.parser.elements();nodes.hasMoreNodes();){
Node node=nodes.nextNode();
if(node instanceof FormTag){
FormTag formTag=(FormTag)node;
String formName=formTag.getFormName();
String formAction=formTag.getFormLocation();
//if the form doesn't have any name lets assumesome
if((formName == null) || (formName.equals(""))){
formName="MYFORMWITHNONAMEONIT";
}
System.out.println("Parsing form "+formName);
boolean firstpage=false;
NodeList inputtags=formTag.getFormInputs();
NodeIterator inputnodes=inputtags.elements();
while(inputnodes.hasMoreNodes()){
InputTag inputtag=(InputTag)inputnodes.nextNode();
Hashtable mytable=inputtag.getAttributes();
String name=(String)mytable.get("NAME");
String value=(String)mytable.get("VALUE");
String type=(String)mytable.get("TYPE");
System.out.println("Name: "+name+" Value: "+value);
if(type.equalsIgnoreCase("hidden")){
if(null != value){
System.out.println("HIDDEN Name: "+name+" Value: "+value);
}
}
}
}
if(node instanceof LinkTag){
LinkTag linkTag=(LinkTag)node;
if(linkTag.isHTTPLikeLink())
{
String linktext=linkTag.getLinkText();
String link=linkTag.getLink();
System.out.println("Link text: "+linktext+" Link: "+link);
}
}
}
}//End try
catch(ParserException pe){
pe.printStackTrace(System.err);
}
}
}
===================================
HTML File:
<html>
<body>
<form action="/cgi-bin/test.pl" method="post">
<table><tr><td>
<INPUT type=hidden NAME="test1" VALUE="insidetable">
</td></tr>
</table>
<INPUT type=hidden NAME="Test2" VALUE="outsidetable">
<INPUT type=hidden name="a" value="b">
</form>
</body>
</html>
==================================
Output:
Parsing form MYFORMWITHNONAMEONIT
Name: Test2 Value: outsidetable
HIDDEN Name: Test2 Value: outsidetable
Name: a Value: b
HIDDEN Name: a Value: b
</pre>
</pre>
Looks like a bug, although the code looks correct.
The input tags come from a recursive examination of all the children:
this.formInputList = compositeTagData.getChildren().searchFor(InputTag.class, true);
That second argument of 'true' says recursive.
I find in most of these anomalous cases the table is not correctly formed and the row or column has consumed too little or too much, but in this case it looks too simple to have screwed up.
As a workaround, try using just a bald "Node [] list = parser.extractAllNodesThatAre(InputTag.class);".
In any case it looks like you should file a bug report.