Thread: [Htmlparser-user] HTML parser for HTML translation
Brought to you by:
derrickoswald
From: Aminudin K. <ami...@mi...> - 2003-01-24 12:14:35
|
Hi guys, I'm very new in this forum. Hello everybody .... :) I'm finding some tools/libraries that can be used as HTML parser. So I found this HTMLParser on sourceforge and I hope it can help me to develop HTML translation module. What I want to do is to parse HTML code and translate the content and the put the translated text/content back into the original HTML structure. Does this HTML parser suitable of doing this kind of task ? |
From: Somik R. <so...@ya...> - 2003-01-24 18:03:37
|
> What I want to do is to parse HTML code and > translate the content and > the put the translated text/content back into the > original HTML structure. > > Does this HTML parser suitable of doing this kind of > task ? By translating content, I guess you mean translation of meaningful text data (not tags). That is easily possible. You can look at the StringExtractor example (org.htmlparser.parserapplications) or the StringFindingVisitor (org.htmlparser.visitors). The simplest approach is to write your own visitor - StringTranslatingVisitor, that runs through the entire html, and wherever it finds strings, these are translated as per your wishes. Here is a sample program : import org.htmlparser.HTMLRemarkNode; import org.htmlparser.HTMLStringNode; import org.htmlparser.tags.HTMLEndTag; import org.htmlparser.tags.HTMLTag; public class StringTranslatingVisitor extends HTMLVisitor { StringBuffer htmlData = new StringBuffer(); public void visitStringNode(HTMLStringNode stringNode) { String yourStuff=""; // Perform modifications here. // finally, add to htmlData htmlData.append(yourStuff); } public void visitEndTag(HTMLEndTag endTag) { htmlData.append(endTag.toHTML()); } public void visitTag(HTMLTag tag) { htmlData.append(tag.toHTML()); } public String getHtml() { return htmlData.toString(); } public void visitRemarkNode(HTMLRemarkNode remarkNode) { htmlData.append(remarkNode.toHTML()); } } To use this, create your parser - HTMLParser parser = new HTMLParser("http://someurl.com"); parser.registerScanners(); StringTranslatingVisitor visitor = new StringTranslatingVisitor(); parser.visitAllNodesWith(visitor); System.out.println(visitor.getHTML()); Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Aminudin K. <ami...@mi...> - 2003-01-30 08:00:51
|
Hi, thank you for giving a sample program. I've tried to compiled the program but JAVAC couldn't find HTMLVisitor class . There are some other errors too. Below are the codes and errors . *Errors : StringTranslatingVisitor.java:1: cannot resolve symbol symbol : class visitors location: package htmlparser import org.htmlparser.visitors; ^ StringTranslatingVisitor.java:9: cannot resolve symbol symbol : class HTMLVisitor location: class StringTranslatingVisitor public class StringTranslatingVisitor extends HTMLVisitor{ ^ StringTranslatingVisitor.java:39: cannot resolve symbol symbol : method visitAllNodesWith (StringTranslatingVisitor) location: class org.htmlparser.HTMLParser parser.visitAllNodesWith(visitor); ^ StringTranslatingVisitor.java:40: cannot resolve symbol symbol : method getHTML () location: class StringTranslatingVisitor System.out.println(visitor.getHTML()); * import org.htmlparser.HTMLParser; import org.htmlparser.HTMLRemarkNode; import org.htmlparser.HTMLStringNode; import org.htmlparser.tags.HTMLEndTag; import org.htmlparser.tags.HTMLTag; public class StringTranslatingVisitor extends HTMLVisitor{ StringBuffer htmlData = new StringBuffer(); public void visitStringNode(HTMLStringNode stringNode) { String yourStuff="TextToBeTranslated"; // Perform modifications here. // finally, add to htmlData htmlData.append(yourStuff); } public void visitEndTag(HTMLEndTag endTag) { htmlData.append(endTag.toHTML()); } public void visitTag(HTMLTag tag) { htmlData.append(tag.toHTML()); } public String getHtml() { return htmlData.toString(); } public void visitRemarkNode(HTMLRemarkNode remarkNode) { htmlData.append(remarkNode.toHTML()); } public static void main(String args[]){ HTMLParser parser = new HTMLParser("http://www.yahoo.com"); parser.registerScanners(); StringTranslatingVisitor visitor = new StringTranslatingVisitor(); parser.visitAllNodesWith(visitor); System.out.println(visitor.getHTML()); } } Somik Raha wrote: >>What I want to do is to parse HTML code and >>translate the content and >>the put the translated text/content back into the >>original HTML structure. >> >>Does this HTML parser suitable of doing this kind of >>task ? >> >> > >By translating content, I guess you mean translation >of meaningful text data (not tags). That is easily >possible. You can look at the StringExtractor example >(org.htmlparser.parserapplications) or the >StringFindingVisitor (org.htmlparser.visitors). > >The simplest approach is to write your own visitor - >StringTranslatingVisitor, that runs through the entire >html, and wherever it finds strings, these are >translated as per your wishes. > >Here is a sample program : >import org.htmlparser.HTMLRemarkNode; >import org.htmlparser.HTMLStringNode; >import org.htmlparser.tags.HTMLEndTag; >import org.htmlparser.tags.HTMLTag; > >public class StringTranslatingVisitor extends >HTMLVisitor { > StringBuffer htmlData = new StringBuffer(); > > public void visitStringNode(HTMLStringNode >stringNode) { > String yourStuff=""; > // Perform modifications here. > // finally, add to htmlData > htmlData.append(yourStuff); > } > > public void visitEndTag(HTMLEndTag endTag) { > htmlData.append(endTag.toHTML()); > } > > public void visitTag(HTMLTag tag) { > htmlData.append(tag.toHTML()); > } > > public String getHtml() { > return htmlData.toString(); > } > public void visitRemarkNode(HTMLRemarkNode >remarkNode) { > htmlData.append(remarkNode.toHTML()); > } > >} > >To use this, create your parser - >HTMLParser parser = new >HTMLParser("http://someurl.com"); >parser.registerScanners(); >StringTranslatingVisitor visitor = > new StringTranslatingVisitor(); >parser.visitAllNodesWith(visitor); >System.out.println(visitor.getHTML()); > >Regards, >Somik > >__________________________________________________ >Do you Yahoo!? >Yahoo! Mail Plus - Powerful. Affordable. Sign up now. >http://mailplus.yahoo.com > > >------------------------------------------------------- >This SF.NET email is sponsored by: >SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! >http://www.vasoftware.com >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Somik R. <so...@ya...> - 2003-01-30 17:44:46
|
You might be having an older version of the parser. Make sure you have the latest integration release 1.3-20030125 Regards, Somik --- Aminudin Khalid <ami...@mi...> wrote: > Hi, thank you for giving a sample program. > > I've tried to compiled the program but JAVAC > couldn't find HTMLVisitor > class . There are some other errors too. Below are > the codes and errors . > > > > *Errors : > StringTranslatingVisitor.java:1: cannot resolve > symbol > symbol : class visitors > location: package htmlparser > import org.htmlparser.visitors; > ^ > StringTranslatingVisitor.java:9: cannot resolve > symbol > symbol : class HTMLVisitor > location: class StringTranslatingVisitor > public class StringTranslatingVisitor extends > HTMLVisitor{ > ^ > StringTranslatingVisitor.java:39: cannot resolve > symbol > symbol : method visitAllNodesWith > (StringTranslatingVisitor) > location: class org.htmlparser.HTMLParser > parser.visitAllNodesWith(visitor); > ^ > StringTranslatingVisitor.java:40: cannot resolve > symbol > symbol : method getHTML () > location: class StringTranslatingVisitor > System.out.println(visitor.getHTML()); > * > > > > import org.htmlparser.HTMLParser; > import org.htmlparser.HTMLRemarkNode; > import org.htmlparser.HTMLStringNode; > import org.htmlparser.tags.HTMLEndTag; > import org.htmlparser.tags.HTMLTag; > > > public class StringTranslatingVisitor extends > HTMLVisitor{ > StringBuffer htmlData = new StringBuffer(); > > public void visitStringNode(HTMLStringNode > stringNode) { > String yourStuff="TextToBeTranslated"; > // Perform modifications here. > // finally, add to htmlData > htmlData.append(yourStuff); > } > > public void visitEndTag(HTMLEndTag endTag) { > htmlData.append(endTag.toHTML()); > } > > public void visitTag(HTMLTag tag) { > htmlData.append(tag.toHTML()); > } > > public String getHtml() { > return htmlData.toString(); > } > > public void visitRemarkNode(HTMLRemarkNode > remarkNode) { > htmlData.append(remarkNode.toHTML()); > } > > public static void main(String args[]){ > HTMLParser parser = new > HTMLParser("http://www.yahoo.com"); > parser.registerScanners(); > StringTranslatingVisitor visitor = new > StringTranslatingVisitor(); > parser.visitAllNodesWith(visitor); > System.out.println(visitor.getHTML()); > > > } > } > > > > Somik Raha wrote: > > >>What I want to do is to parse HTML code and > >>translate the content and > >>the put the translated text/content back into the > >>original HTML structure. > >> > >>Does this HTML parser suitable of doing this kind > of > >>task ? > >> > >> > > > >By translating content, I guess you mean > translation > >of meaningful text data (not tags). That is easily > >possible. You can look at the StringExtractor > example > >(org.htmlparser.parserapplications) or the > >StringFindingVisitor (org.htmlparser.visitors). > > > >The simplest approach is to write your own visitor > - > >StringTranslatingVisitor, that runs through the > entire > >html, and wherever it finds strings, these are > >translated as per your wishes. > > > >Here is a sample program : > >import org.htmlparser.HTMLRemarkNode; > >import org.htmlparser.HTMLStringNode; > >import org.htmlparser.tags.HTMLEndTag; > >import org.htmlparser.tags.HTMLTag; > > > >public class StringTranslatingVisitor extends > >HTMLVisitor { > > StringBuffer htmlData = new StringBuffer(); > > > > public void visitStringNode(HTMLStringNode > >stringNode) { > > String yourStuff=""; > > // Perform modifications here. > > // finally, add to htmlData > > htmlData.append(yourStuff); > > } > > > > public void visitEndTag(HTMLEndTag endTag) { > > htmlData.append(endTag.toHTML()); > > } > > > > public void visitTag(HTMLTag tag) { > > htmlData.append(tag.toHTML()); > > } > > > > public String getHtml() { > > return htmlData.toString(); > > } > > public void visitRemarkNode(HTMLRemarkNode > >remarkNode) { > > htmlData.append(remarkNode.toHTML()); > > } > > > >} > > > >To use this, create your parser - > >HTMLParser parser = new > >HTMLParser("http://someurl.com"); > >parser.registerScanners(); > >StringTranslatingVisitor visitor = > > new StringTranslatingVisitor(); > >parser.visitAllNodesWith(visitor); > >System.out.println(visitor.getHTML()); > > > >Regards, > >Somik > > > >__________________________________________________ > >Do you Yahoo!? > >Yahoo! Mail Plus - Powerful. Affordable. Sign up > now. > >http://mailplus.yahoo.com > > > > > >------------------------------------------------------- > >This SF.NET email is sponsored by: > >SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > >http://www.vasoftware.com > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-01-31 12:26:54
|
hi there, i'm might want to use this for my project. my project is to extract java quizzes which is either in <form> or just in plain html. however, can this package be used to extract the related content. how i can use the get method because java quizzes with <form> requires the user to POST answers for the quiz before I can obtain the relevant answers, I want to harvest both answers and question. Some sample if you mind to take a look at it :) http://developer.java.sun.com/developer/quizzes/jbasic1-1 http://wwww.angelfire.com/or/abhilash/main.html I really hope you can help me on this matter, thank you :) |
From: Somik R. <so...@ya...> - 2003-01-31 17:59:20
|
I couldnt see either of the urls you sent. But, what you want to do is possible with the parser. Sending POST requests is a new feature in 1.3 (get the latest integration release). From the testcases, here's a sample (showing creation of the parser) <code> url = new URL ("http://www.canadapost.ca/tools/pcl/bin/cp_search_response-e.asp"); connection = (HttpURLConnection)url.openConnection (); connection.setRequestMethod ("POST"); connection.setRequestProperty ("Referer", "http://www.canadapost.ca/tools/pcl/bin/default-e.asp"); connection.setDoOutput (true); connection.setDoInput (true); connection.setUseCaches (false); buffer = new StringBuffer (1024); buffer.append ("app_language="); buffer.append ("english"); buffer.append ("&"); buffer.append ("app_response_start_row_number="); buffer.append ("1"); buffer.append ("&"); buffer.append ("app_response_rows_max="); buffer.append ("9"); buffer.append ("&"); buffer.append ("app_source="); buffer.append ("quick"); buffer.append ("&"); buffer.append ("query_source="); buffer.append ("q"); buffer.append ("&"); buffer.append ("name="); buffer.append ("&"); buffer.append ("postal_code="); buffer.append ("&"); buffer.append ("directory_area_name="); buffer.append ("&"); buffer.append ("delivery_mode="); buffer.append ("&"); buffer.append ("Suffix="); buffer.append ("&"); buffer.append ("street_direction="); buffer.append ("&"); buffer.append ("installation_type="); buffer.append ("&"); buffer.append ("delivery_number="); buffer.append ("&"); buffer.append ("installation_name="); buffer.append ("&"); buffer.append ("unit_numbere="); buffer.append ("&"); buffer.append ("app_state="); buffer.append ("production"); buffer.append ("&"); buffer.append ("street_number="); buffer.append (number); buffer.append ("&"); buffer.append ("street_name="); buffer.append (street); buffer.append ("&"); buffer.append ("street_type="); buffer.append (type); buffer.append ("&"); buffer.append ("test="); buffer.append ("&"); buffer.append ("city="); buffer.append (city); buffer.append ("&"); buffer.append ("prov="); buffer.append (province); buffer.append ("&"); buffer.append ("Search="); out = new PrintWriter (connection.getOutputStream ()); out.print (buffer); out.close (); parser = new HTMLParser (connection); </code> Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Aminudin K. <ami...@mi...> - 2003-02-06 07:13:42
|
Hi, Currently I am testing HTMLParser for my HTML translation engine. FYI, I am using the latest integration module , Version 1.3 dated 3 February, 2003. I had problem when using htmlparser.jar , it couldn't find HTMLVisitor(I mean org.htmlparser.visitors) but it could find HTMLParser. Does this means that HTMLVisitor is not included in the pre-compiled binary that comes along with the integration release ? If recompile is the answer , then I have to learn Ant . Thanks for support :) --------------------Error--------------------------- htmlTrans.java:10: cannot resolve symbol symbol : class visitors location: package htmlparser import org.htmlparser.visitors; ^ htmlTrans.java:17: cannot resolve symbol symbol : class TextExtractingVisitor location: class htmlTrans TextExtractingVisitor visitor = new TextExtractingVisitor(); ^ htmlTrans.java:17: cannot resolve symbol symbol : class TextExtractingVisitor location: class htmlTrans TextExtractingVisitor visitor = new TextExtractingVisitor(); ^ 3 errors ---------------------------------------------------------------------------------------- Below are the codes import java.util.*; import java.io.*; import org.htmlparser.HTMLParser; import org.htmlparser.HTMLRemarkNode; import org.htmlparser.HTMLStringNode; import org.htmlparser.tags.HTMLEndTag; import org.htmlparser.tags.HTMLTag; import org.htmlparser.util.HTMLParserException; import org.htmlparser.visitors; public class htmlTrans { public static void main(String args[]){ try { HTMLParser parser = new HTMLParser("http://www.yahoo.com"); TextExtractingVisitor visitor = new TextExtractingVisitor(); parser.visitAllNodesWith(visitor); }catch (HTMLParserException e){ System.out.println("Error"); } } } |
From: Somik R. <so...@ya...> - 2003-02-07 05:41:50
|
Aminudin Khalid writes: > Currently I am testing HTMLParser for my HTML translation engine. FYI, I > am using the latest integration module , Version 1.3 dated 3 February, 2003. > > I had problem when using htmlparser.jar , it couldn't find > HTMLVisitor(I mean org.htmlparser.visitors) but it could find > HTMLParser. Does this means that HTMLVisitor is not included in the > pre-compiled binary that comes along with the integration release ? It is - I just cross-checked, HTMLVisitor is very much a part of the release. Pls verify again. (It is in lib/htmlparser.jar) Regards, Somik |
From: Aminudin K. <ami...@mi...> - 2003-02-07 09:04:12
|
Hi, You're right, HTMLVisitor does exist in htmlparser.jar . Many strange things happened during compilation but I've managed to reduce some errors. Could u guys help me analyzing what is wrong in the following codes. In HTMParser there is a method called *visitAllNodesWith(visitor)* . The argument's type is *HTMLVisitor. *However, the following class use StringTranslatingVisitor which extends HTMLVisitor as an argument. JAVAC keeps complaining me about this. Your help is appreciated. Thanks p/s : Notice that I've commented out "import org.htmlparser.visitors". I couldn't compile if I include this line. (Any reason/ idea ?) FYI, my development platform is Linux. --------------- Error ------------------------------------------ StringTranslatingVisitor.java:45: visitAllNodesWith(org.htmlparser.visitors.HTMLVisitor) in org.htmlparser.HTMLParser cannot be applied to (StringTranslatingVisitor) parser.visitAllNodesWith(visitor); ^ 1 error ----------------------------------------------------------------- -----------------------------------JAVA Code ------------------------- import org.htmlparser.HTMLParser; import org.htmlparser.HTMLRemarkNode; import org.htmlparser.HTMLStringNode; import org.htmlparser.tags.HTMLEndTag; import org.htmlparser.tags.HTMLTag; import org.htmlparser.util.HTMLParserException; //import org.htmlparser.visitors; public class StringTranslatingVisitor extends HTMLVisitor{ StringBuffer htmlData = new StringBuffer(); public void visitStringNode(HTMLStringNode stringNode) { String yourStuff="htmlTrans"; // Perform modifications here. // finally, add to htmlData htmlData.append(yourStuff); } public void visitEndTag(HTMLEndTag endTag) { htmlData.append(endTag.toHTML()); } public void visitTag(HTMLTag tag) { htmlData.append(tag.toHTML()); } public String getHtml() { return htmlData.toString(); } public void visitRemarkNode(HTMLRemarkNode remarkNode) { htmlData.append(remarkNode.toHTML()); } public static void main(String args[]){ try{ HTMLParser parser = new HTMLParser("http://www.yahoo.com"); parser.registerScanners(); StringTranslatingVisitor visitor = new StringTranslatingVisitor(); parser.visitAllNodesWith(visitor); }catch (HTMLParserException e){ System.out.println("error :) "); } } } --------------------------------------------------------------------------- |
From: Somik R. <so...@ya...> - 2003-02-07 19:25:27
|
--- Aminudin Khalid <ami...@mi...> wrote: > Could u guys help me analyzing what is wrong in the > following codes. In > HTMParser there is a method called > *visitAllNodesWith(visitor)* . The > argument's type is *HTMLVisitor. *However, the > following class use > StringTranslatingVisitor which extends HTMLVisitor > as an argument. JAVAC > keeps complaining me about this. > > Your help is appreciated. Thanks > > p/s : Notice that I've commented out "import > org.htmlparser.visitors". You have to import like this : import org.htmlparser.visitors.*; or import org.htmlparser.visitors.HTMLVisitor; Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-08 16:52:22
|
hi there, i know this may sound stupid, I want a program that when reading <li> or <l0> tag it would add something into the elements in the HTMLNode vector such as "1" and increment it whenever it sees the tag. another question is how i can use Translate class to translate all the code such   into "." and that kind of stuff. |