Menu

Regexp-result/template question

Help
pete
2006-10-31
2012-09-04
  • pete

    pete - 2006-10-31

    hello

    Great package. Thank you.

    I need to pre-process the output from <html-to-xml> with a regular expression to fix a missing ">" on a <TBODY> tag.

    Is it possible to do this with the regex functionality (or alternative)?

    It is unclear to me how to use the regexp-result/template to accomplish this.

    Could someone please explain this in more detail?

    Thanks,

    -Pete

    <file action="write" path="replacedTbody.txt">
    <regexp replace="true">
    <regexp-pattern>(&lt;tbody)([^&gt;])</regexp-pattern>
    <regexp-source>
    <text>
    <html-to-xml>
    <http url= "${url_start}" />
    </html-to-xml>
    </text>
    </regexp-source>
    <regexp-result>
    <template>${_1}&gt;${_2}</template>
    </regexp-result>
    </regexp>

     
    • Vladimir Nikic

      Vladimir Nikic - 2006-10-31

      As I see, your example works fine - it does the job.
      html-to-xml itself cannot recover missing < or >.
      Probably regular expressions are best choice for this kind of text processing.

      When specifiying replace="true" in regexp processor, only matching values for the pattern (in your case <tbody* are replaced with the value from regexp-result, the rest is copied from regexp-source as is.

      I have tried an data example with your config and it work fine.

       
    • pete

      pete - 2006-11-01

      Thanks for the quick response!

      Upon closer examination it looked like I needed to use Jtidy to preprocess the html because it could better resolve the <TBODY tag. I finally put a simple step to call JTidy before the tagsoup call. I could not get the http client to be happy w/Jtidy output (even w/xmlout=true, hence the inline jtidy call prior to the tagsoup call. ... probably not very efficient but I am learning! The following trivial code needs some care and feeding but maybe it will help someone with similar problem.

      package org.webharvest.runtime.html;

      import java.io.*;

      import org.webharvest.exception.ParserException;
      import org.webharvest.utils.XMLWriter;
      import org.xml.sax.InputSource;
      import org.xml.sax.SAXException;

      //import org.w3c.tidy.Tidy

      public class JTidyProcessor implements IXHtmlProcessor {

      public String execute(String content) {
          try {
      
              System.out.println(&quot;JTidyProcessor implements IXHtmlProcessor &quot;);
              org.w3c.tidy.Tidy tidy = new org.w3c.tidy.Tidy();
              tidy.setIndentContent(true);
              tidy.setXmlOut(false);
              // tidy.setInputStreamName();
      
              java.io.InputStream tidyIn = null;
      
              // in = new java.io.ByteArrayInputStream(content.getBytes(&quot;UTF-8&quot;));
              tidyIn = new java.io.ByteArrayInputStream(content.getBytes());
      
              System.out
                      .println(&quot;JTidyProcessor implements IXHtmlProcessor: past inputstream &quot;);
      
              java.io.OutputStream tidyOut = new ByteArrayOutputStream();
              // java.io.OutputStream out = new
              // FileOutputStream(&quot;C:\\wh\\work\\tidyout.xml&quot;);
              System.out
                      .println(&quot;JTidyProcessor implements IXHtmlProcessor: past outputstream &quot;);
              tidy.parse(tidyIn, tidyOut);
              System.out
                      .println(&quot;JTidyProcessor implements IXHtmlProcessor: past parse &quot;);
              tidyOut.flush();
              System.out
                      .println(&quot;JTidyProcessor implements IXHtmlProcessor: past flush &quot;);
              //System.out.println(tidyOut.toString());
              System.out
              .println(&quot;JTidyProcessor call tagsoup functionality &quot;);
              org.ccil.cowan.tagsoup.Parser parser = new org.ccil.cowan.tagsoup.Parser();
              parser.setFeature(&quot;http://xml.org/sax/features/namespaces&quot;, false);
              parser
                      .setFeature(
                              &quot;http://www.ccil.org/~cowan/tagsoup/features/default-attributes&quot;,
                              false);
      
              StringWriter stringWriter = new java.io.StringWriter();
              XMLWriter writer = new XMLWriter(stringWriter);
              parser.setContentHandler(writer);
              StringReader reader = new StringReader(tidyOut.toString());
      
              InputSource in = new InputSource(reader);
              parser.parse(in);
              stringWriter.flush();
      
              return stringWriter.toString();
          } catch (IOException e) {
              throw new ParserException(e);
          } catch (SAXException e) {
              throw new ParserException(e);
          }
      
      }
      

      }

       
      • Vladimir Nikic

        Vladimir Nikic - 2006-11-01

        Html cleaning is the weakest part of Web-Harvest. I'm really not heppy with TagSoup - it has some very serious weaknesses. In some cases it simply doesn't work well. For example if there are DIV elements inside A, TagSoup pushes them out, so the original structure is changes. Furthermore it produces strange sequences of formatting tags (B, EM, I...) in some cases.
        I have tried some others like JTidy and NekoHtml. None of them is satisfiable. JTidy has problems with scripts and as I saw on the site it is not maintained any more. Last version is from 2001!
        So I've chosen TagSoup as the laast bad solution, but I'm seriously thinking about writing new html-to-xhtml utility.

         
    • pete

      pete - 2006-11-03

      Is there any way to use a defined variable for the regexp-pattern element? I don't know if there is a mechanism to preprocess the config file so the variable reference is not interpreted as the actual regulare expression pattern.

      for example

      <var-def name="my_reg_exp_var">
      <template>&lt;td&gt;${passedParameterValue.toString()}&lt;td&gt;</template>
      </var-def>

      <regexp-pattern>${my_reg_exp_var}</regexp-pattern>

      <regexp-pattern>&lt;td&gt;This text is known to caller and could be a parameter &lt;/td&gt;</regexp-pattern>

      <regexp-pattern>${my_reg_exp_var}</regexp-pattern>

      Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal repetition near index 0
      ${my_reg_exp_var} Note I changed the name of the var here to match my example above
      ^
      thanks,
      -Pete

       
    • pete

      pete - 2006-11-03

      DUH : <regexp-pattern><template>${my_reg_exp_var}</template></regexp-pattern>

      solves my problem!

       

Log in to post a comment.