I have alot of HTML Dokuments and need to filter them from Javascripts Links (maybe more!)
has anybody an example how to do this ?
Q2: the Parser use to Parse a URL odr Local HTML Dokument , i would like to know if i can give the Parser a HTML String and define Filter that i get back a HTML Code that dose not contain tags that i have filtered!
thanks :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You might try the NodeVisitor pattern. Create a class that implements NodeVisitor and overrides some of the methods:
class MyVisitor implements NodeVisitor
{
boolean inbody = false;
public void visitTag (Tag tag)
{
if (inbody)
if (!tag.getTagName().equals("A"))
System.out.println (tag.toHtml ());
if (tag.getTagName().equals("BODY"))
inbody = true;
}
public void visitEndTag (Tag tag)
{
if (tag.getTagName().equals("BODY"))
inbody = false;
if (inbody)
System.out.println (tag.toHtml ());
}
}
Then run through all the nodes with:
parser.visitAllNodesWith(new MyVisitor ());
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have alot of HTML Dokuments and need to filter them from Javascripts Links (maybe more!)
has anybody an example how to do this ?
Q2: the Parser use to Parse a URL odr Local HTML Dokument , i would like to know if i can give the Parser a HTML String and define Filter that i get back a HTML Code that dose not contain tags that i have filtered!
thanks :)
how to Filter this html code:
<html>
<head>
<title>test</title>
</head>
<body>
<p><a href="http://www.google.de">http://www.google.de</a></p>
<p>text</p>
<p>text</p>
<table border="1" width="100%">
<tr>
<td width="50%">table</td>
<td width="50%">table</td>
</tr>
</table>
<form method="POST" action="--WEBBOT-SELF--">
<p><input type="button" value="Button" name="B3"></p>
</form>
<p> the text</p>
</body>
</html>
to this html Code:
<p>text</p>
<p>text</p>
<table border="1" width="100%">
<tr>
<td width="50%">table</td>
<td width="50%">table</td>
</tr>
</table>
<p> </p>
<p> the text</p>
-----------------------------------------
thats mean : to parse the html site and remove the html tag , title tag ,form tags , link tags and get as result whhat between <Body> and </Body>
You might try the NodeVisitor pattern. Create a class that implements NodeVisitor and overrides some of the methods:
class MyVisitor implements NodeVisitor
{
boolean inbody = false;
public void visitTag (Tag tag)
{
if (inbody)
if (!tag.getTagName().equals("A"))
System.out.println (tag.toHtml ());
if (tag.getTagName().equals("BODY"))
inbody = true;
}
public void visitEndTag (Tag tag)
{
if (tag.getTagName().equals("BODY"))
inbody = false;
if (inbody)
System.out.println (tag.toHtml ());
}
}
Then run through all the nodes with:
parser.visitAllNodesWith(new MyVisitor ());
oops, extends not implements
thank you for help , thats work but it dose not return back the Tags that i need with the text!! i am getting back just the tags!
i am getting this :
<p></p>
<p></p>
<table border="1" width="100%">
<tr>
<td width="50%"></td>
<td width="50%"></td>
</tr>
</table>
<p></p>
<p> </p>
schuld be this :
<p>text</p>
<p>text</p>
<table border="1" width="100%">
<tr>
<td width="50%">table</td>
<td width="50%">table</td>
</tr>
</table>
<p> </p>
<p> the text</p>
add this method:
public void visitStringNode (StringNode stringNode)
{
System.out.println (stringNode.toHtml ());
}