I would like to remove HTML tags from a text, but I need to keep any combination of comparison operators (e.g. “1 < 2 < 3” or “Monday < Tuesday and Friday > Wednesday”). This is equivalent to say that I need to keep unknown HTML tags.
So How can I differentiate Nodes that look like tags but are not HTML tags?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am not satisfy with the parser's job. Let's have a look on this example:
public class parseTags {
public static void main(String[] args)
throws Exception {
String myHtml = "<span>blabla</span><BR>and other blabla<H1>Big</H1><font face=\"arial\" color=\"RED\" size=\"2\"><b>font and so on</b></font> plus <mytag>and now, 2<3, but <hidden text> and also <Why can I see this ?> more <<<<<<<(7) and >>>>>>>>>(9)";
Dear norb,
my solutions may not be efficient but two solutions can help you
1) register all the tags having
in javaoc it is given how to create you custom tags eg:
import org.htmlparser.tags.CompositeTag ;
public class MyFontTag extends CompositeTag
{
public static StringBuffer mBuffer = new StringBuffer ();
/**
* The set of names handled by this tag.
*/
private static final String[] mIds = new String[] {"FONT","H1","SPAN","BR","B"};//,"BR","TABLE"};
private static final String[] mEndTagEnders = new String[] {"BODY", "HTML","TABLE","TD","TR","FONT"};
/**
* Create a new text tag.
*/
public MyFontTag ()
{
setThisScanner (mDefaultCompositeScanner);
}
public String[] getEndTagEnders ()
{
return (mEndTagEnders);
}
public String[] getEnders()
{
return (mEndTagEnders);
}
/**
* Return the set of names handled by this tag.
* @return The names to be matched that create tags of this type.
*/
public String[] getIds ()
{
return (mIds);
}
}
now in you program
public class parseTags {
public static void main(String[] args)
throws Exception {
String myHtml = "<span>blabla</span><BR>and other blabla<H1>Big</H1><font face=\"arial\" color=\"RED\" size=\"2\"><b>font and so on</b></font> plus <mytag>and now, 2<3, but <hidden text> and also <Why can I see this ?> more <<<<<<<(7) and >>>>>>>>>(9)";
String textDescription = "";
Lexer lex = new Lexer(myHtml);
Parser parser = new Parser(lex);
PrototypicalNodeFactory factory = new PrototypicalNodeFactory ();
factory.registerTag (new MyFontTag ());
parser.setNodeFactory (factory);
for(NodeIterator e=parser.elements();e.hasMoreNodes();){
Node node =e.nextNode();
if(!(node instanceof CompositeTag)||node instanceof TextNode)System.out.println (node.toHtml());
if(node instanceof CompositeTag )System.out.println (node.toPlainTextString());
}
}
}
2) you can create a table of HTML Tags and check for the name in it as problem is created by relaxed handling of tags .
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would like to remove HTML tags from a text, but I need to keep any combination of comparison operators (e.g. “1 < 2 < 3” or “Monday < Tuesday and Friday > Wednesday”). This is equivalent to say that I need to keep unknown HTML tags.
So How can I differentiate Nodes that look like tags but are not HTML tags?
this will be taken care by parser itself
I am not satisfy with the parser's job. Let's have a look on this example:
public class parseTags {
public static void main(String[] args)
throws Exception {
String myHtml = "<span>blabla</span><BR>and other blabla<H1>Big</H1><font face=\"arial\" color=\"RED\" size=\"2\"><b>font and so on</b></font> plus <mytag>and now, 2<3, but <hidden text> and also <Why can I see this ?> more <<<<<<<(7) and >>>>>>>>>(9)";
String textDescription = "";
Lexer lex = new Lexer(myHtml);
Node nono = lex.nextNode();
while (nono != null) {
if (nono instanceof TextNode) {
textDescription += nono.getText();
}
nono = lex.nextNode();
}
System.out.println(textDescription);
}
}
This returns:
"blablaand other blablaBigfont and so on plus and now, 2<3, but and also more <<<<<<<(7) and >>>>>>>>>(9)"
So, "<mytag>", "<hidden text>" and "<Why can I see this ?>" are missing.
Where am I wrong?
Dear norb,
my solutions may not be efficient but two solutions can help you
1) register all the tags having
in javaoc it is given how to create you custom tags eg:
import org.htmlparser.tags.CompositeTag ;
public class MyFontTag extends CompositeTag
{
public static StringBuffer mBuffer = new StringBuffer ();
/**
* The set of names handled by this tag.
*/
private static final String[] mIds = new String[] {"FONT","H1","SPAN","BR","B"};//,"BR","TABLE"};
private static final String[] mEndTagEnders = new String[] {"BODY", "HTML","TABLE","TD","TR","FONT"};
/**
* Create a new text tag.
*/
public MyFontTag ()
{
setThisScanner (mDefaultCompositeScanner);
}
public String[] getEndTagEnders ()
{
return (mEndTagEnders);
}
public String[] getEnders()
{
return (mEndTagEnders);
}
/**
* Return the set of names handled by this tag.
* @return The names to be matched that create tags of this type.
*/
public String[] getIds ()
{
return (mIds);
}
}
now in you program
public class parseTags {
public static void main(String[] args)
throws Exception {
String myHtml = "<span>blabla</span><BR>and other blabla<H1>Big</H1><font face=\"arial\" color=\"RED\" size=\"2\"><b>font and so on</b></font> plus <mytag>and now, 2<3, but <hidden text> and also <Why can I see this ?> more <<<<<<<(7) and >>>>>>>>>(9)";
String textDescription = "";
Lexer lex = new Lexer(myHtml);
Parser parser = new Parser(lex);
PrototypicalNodeFactory factory = new PrototypicalNodeFactory ();
factory.registerTag (new MyFontTag ());
parser.setNodeFactory (factory);
for(NodeIterator e=parser.elements();e.hasMoreNodes();){
Node node =e.nextNode();
if(!(node instanceof CompositeTag)||node instanceof TextNode)System.out.println (node.toHtml());
if(node instanceof CompositeTag )System.out.println (node.toPlainTextString());
}
}
}
2) you can create a table of HTML Tags and check for the name in it as problem is created by relaxed handling of tags .
Ok, that's what I was afraid of. Thanks a lot.