HTML Parser / Discussion / Help: extracting stuff before/after a filter match

zerodrift - 2006-02-03

I'd like to match items before or after a specific filter match, e.g., the next three lines after a StringFilter match.

So, the HTML code is:

<P><CENTER>Loads are calculated from raw telemetry data and are approximate.</CENTER>
<CENTER>The displayed values are NOT official PJM Loads.</CENTER>

<BR><BR><BR>

<P><CENTER><H2>Current PJM Transmission Limits</H2></CENTER>
<P>Contingency WYLIERID500 KV WYLIERID TRAN 5 (IROL)
<P>Monitor WYLIERID500 KV WYLIERID TRAN 7 (IROL) -> Redispatch

</BODY>
</HTML>

and I want to extract everything (without the HTML markup) after "Current PJM Transmission Limits". The text you see now may not always be the same but the "Current PJM Transmission Limits" title will always be the same.

Any help please would be appreciated!!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Derrick Oswald - 2006-02-05
  
  Yes, just use the children of the parent of the node you have.
  
  Some methods were recently added on AbstractNode (which TextNode inherits from) to handle this...
  
  getPreviousSibling() and getNextSibling()
  
  These are only available in the latest Integration Build.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - zerodrift - 2006-02-05
    
    can you please supply an example snippet as a starter?
    
    it would be easier to understand.
    
    here's where i'm stuck...
    
    ----
    try
            {
                URLTunnelReader in = new URLTunnelReader();
                InputStream inStream = in.GetSecureConnection(args[0]);
                Page pg = new Page(inStream, "ISO-8859-1");
                Lexer lex = new Lexer(pg);
                Parser parser = new Parser(lex);
                StringFilter filter = new StringFilter();
                filter.setCaseSensitive(false);
    //
                filter.setPattern("Current PJM Transmission Limits");
    
                NodeList list = parser.parse(filter);
    
    
    
    
            }
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - zerodrift - 2006-02-05
    
    with this code:
    
    ---try
            {
                URLTunnelReader in = new URLTunnelReader();
                InputStream inStream = in.GetSecureConnection(args[0]);
                Page pg = new Page(inStream, "ISO-8859-1");
                Lexer lex = new Lexer(pg);
                Parser parser = new Parser(lex);
                StringFilter filter = new StringFilter();
                filter.setCaseSensitive(false);
                filter.setPattern("Current PJM Transmission Limits");
                NodeList list = parser.parse(filter);
    
                Node node = list.elementAt(0);
    
                Node peernode = node.getNextSibling();
    
                System.out.println(peernode.getText());
    
            }
    
    ----
    
    and this html page:
    
    ====
    
    </TABLE>
    </CENTER>
    
    <P><CENTER>Loads are calculated from raw telemetry data and are approximate.</CENTER>
    <CENTER>The displayed values are NOT official PJM Loads.</CENTER>
    
    <BR><BR><BR>
    
    <P><CENTER><H2>Current PJM Transmission Limits</H2></CENTER>
    <P>Contingency LINE    500 KV MTSTORM-PRUNTYTO
    <P>Reacinf-ctg BED-BLA -> Redispatch
    
    </BODY>
    </HTML>
    
    ====
    
    I get a Null pointer exception...
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- sidhu - 2006-02-06
  
  Dear zerodrift,
  you can create a new filter
  
  import org.htmlparser.NodeFilter;
  
  public AfterFilter implements NodeFilter{
     NodeFilter filter;
     boolean ret;
  public AfterFilter(NodeFilter nFilter){
     filter = nFilter;
     ret = false;
  }
  public boolean accept(Node node){
     if(!ret && filter.accept(node)){
       ret = true;
       return false;
      }
     return ret;
  }
  }
  
  your code
  ----
  try
  {
  URLTunnelReader in = new URLTunnelReader();
  InputStream inStream = in.GetSecureConnection(args[0]);
  Page pg = new Page(inStream, "ISO-8859-1");
  Lexer lex = new Lexer(pg);
  Parser parser = new Parser(lex);
  NodeFilter filter = new AfterFilter(new StringFilter("Current PJM Transmission Limits"));
  NodeList list =parser.extractAllNodesThatMatch(filter);
  for(int i =0;i<list.size();i++){
  System.out.println (list.elementAt(i).toPlainTextString());
  }
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- sidhu - 2006-02-06
  
  similarly you can have before filter.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - zerodrift - 2006-02-06
    
    Thanks Siddhu,
    
    But this still isn't clear 100%. Do you have some sample using the getNextSibling()?
    
    I'm looking for something more intuitive... :-)
    
    Thanks,
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - zerodrift - 2006-02-06
      
      This one seems to work...
      
              try
              {
                  URLTunnelReader in = new URLTunnelReader();
                  InputStream inStream = in.GetSecureConnection(args[0]);
                  Page pg = new Page(inStream, "ISO-8859-1");
                  Lexer lex = new Lexer(pg);
                  Parser parser = new Parser(lex);
                  NodeFilter filter = new StringFilter("Current PJM Transmission Limits");
                  NodeList list = parser.parse(filter);
      
                  Node parentnode = list.elementAt(0).getParent();
                  System.out.println(parentnode.toPlainTextString());
      
                  while (parentnode.getNextSibling() != null) {
                      parentnode = parentnode.getNextSibling();
                      System.out.println(parentnode.toPlainTextString());
                  }
      
      
              }
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- sidhu - 2006-02-07
  
  Dear zerodrift
  there is no problem in tree traversing if you get desired data.
  And i feel sorry as i don't have code for a similar problem at present.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

extracting stuff before/after a filter match

Forums

Help

extracting stuff before/after a filter match document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

extracting stuff before/after a filter match