I'd like to match items before or after a specific filter match, e.g., the next three lines after a StringFilter match.
So, the HTML code is:
<P><CENTER>Loads are calculated from raw telemetry data and are approximate.</CENTER>
<CENTER>The displayed values are NOT official PJM Loads.</CENTER>
and I want to extract everything (without the HTML markup) after "Current PJM Transmission Limits". The text you see now may not always be the same but the "Current PJM Transmission Limits" title will always be the same.
Any help please would be appreciated!!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
---try
{
URLTunnelReader in = new URLTunnelReader();
InputStream inStream = in.GetSecureConnection(args[0]);
Page pg = new Page(inStream, "ISO-8859-1");
Lexer lex = new Lexer(pg);
Parser parser = new Parser(lex);
StringFilter filter = new StringFilter();
filter.setCaseSensitive(false);
filter.setPattern("Current PJM Transmission Limits");
NodeList list = parser.parse(filter);
Node node = list.elementAt(0);
Node peernode = node.getNextSibling();
System.out.println(peernode.getText());
}
----
and this html page:
====
</TABLE>
</CENTER>
<P><CENTER>Loads are calculated from raw telemetry data and are approximate.</CENTER>
<CENTER>The displayed values are NOT official PJM Loads.</CENTER>
I'd like to match items before or after a specific filter match, e.g., the next three lines after a StringFilter match.
So, the HTML code is:
<P><CENTER>Loads are calculated from raw telemetry data and are approximate.</CENTER>
<CENTER>The displayed values are NOT official PJM Loads.</CENTER>
<BR><BR><BR>
<P><CENTER><H2>Current PJM Transmission Limits</H2></CENTER>
<P>Contingency WYLIERID500 KV WYLIERID TRAN 5 (IROL)
<P>Monitor WYLIERID500 KV WYLIERID TRAN 7 (IROL) -> Redispatch
</BODY>
</HTML>
and I want to extract everything (without the HTML markup) after "Current PJM Transmission Limits". The text you see now may not always be the same but the "Current PJM Transmission Limits" title will always be the same.
Any help please would be appreciated!!
Yes, just use the children of the parent of the node you have.
Some methods were recently added on AbstractNode (which TextNode inherits from) to handle this...
getPreviousSibling() and getNextSibling()
These are only available in the latest Integration Build.
can you please supply an example snippet as a starter?
it would be easier to understand.
here's where i'm stuck...
----
try
{
URLTunnelReader in = new URLTunnelReader();
InputStream inStream = in.GetSecureConnection(args[0]);
Page pg = new Page(inStream, "ISO-8859-1");
Lexer lex = new Lexer(pg);
Parser parser = new Parser(lex);
StringFilter filter = new StringFilter();
filter.setCaseSensitive(false);
//
filter.setPattern("Current PJM Transmission Limits");
NodeList list = parser.parse(filter);
}
with this code:
---try
{
URLTunnelReader in = new URLTunnelReader();
InputStream inStream = in.GetSecureConnection(args[0]);
Page pg = new Page(inStream, "ISO-8859-1");
Lexer lex = new Lexer(pg);
Parser parser = new Parser(lex);
StringFilter filter = new StringFilter();
filter.setCaseSensitive(false);
filter.setPattern("Current PJM Transmission Limits");
NodeList list = parser.parse(filter);
Node node = list.elementAt(0);
Node peernode = node.getNextSibling();
System.out.println(peernode.getText());
}
----
and this html page:
====
</TABLE>
</CENTER>
<P><CENTER>Loads are calculated from raw telemetry data and are approximate.</CENTER>
<CENTER>The displayed values are NOT official PJM Loads.</CENTER>
<BR><BR><BR>
<P><CENTER><H2>Current PJM Transmission Limits</H2></CENTER>
<P>Contingency LINE 500 KV MTSTORM-PRUNTYTO
<P>Reacinf-ctg BED-BLA -> Redispatch
</BODY>
</HTML>
====
I get a Null pointer exception...
Dear zerodrift,
you can create a new filter
import org.htmlparser.NodeFilter;
public AfterFilter implements NodeFilter{
NodeFilter filter;
boolean ret;
public AfterFilter(NodeFilter nFilter){
filter = nFilter;
ret = false;
}
public boolean accept(Node node){
if(!ret && filter.accept(node)){
ret = true;
return false;
}
return ret;
}
}
your code
----
try
{
URLTunnelReader in = new URLTunnelReader();
InputStream inStream = in.GetSecureConnection(args[0]);
Page pg = new Page(inStream, "ISO-8859-1");
Lexer lex = new Lexer(pg);
Parser parser = new Parser(lex);
NodeFilter filter = new AfterFilter(new StringFilter("Current PJM Transmission Limits"));
NodeList list =parser.extractAllNodesThatMatch(filter);
for(int i =0;i<list.size();i++){
System.out.println (list.elementAt(i).toPlainTextString());
}
similarly you can have before filter.
Thanks Siddhu,
But this still isn't clear 100%. Do you have some sample using the getNextSibling()?
I'm looking for something more intuitive... :-)
Thanks,
This one seems to work...
try
{
URLTunnelReader in = new URLTunnelReader();
InputStream inStream = in.GetSecureConnection(args[0]);
Page pg = new Page(inStream, "ISO-8859-1");
Lexer lex = new Lexer(pg);
Parser parser = new Parser(lex);
NodeFilter filter = new StringFilter("Current PJM Transmission Limits");
NodeList list = parser.parse(filter);
Node parentnode = list.elementAt(0).getParent();
System.out.println(parentnode.toPlainTextString());
while (parentnode.getNextSibling() != null) {
parentnode = parentnode.getNextSibling();
System.out.println(parentnode.toPlainTextString());
}
}
Dear zerodrift
there is no problem in tree traversing if you get desired data.
And i feel sorry as i don't have code for a similar problem at present.