Re: [Htmlparser-developer] Tags
Brought to you by:
derrickoswald
From: Derrick O. <der...@gm...> - 2010-09-11 20:09:24
|
Only composite tags are nested... See http://htmlparser.sourceforge.net/faq.html#composite So you would need to create a tag class derived from CoimpositeTag and add it to the node factory, as outlined. On Fri, Sep 10, 2010 at 8:14 PM, Elliot Huntington < ell...@gm...> wrote: > When I reviewed the output of the program a little closer I realized that > although the HtmlParser did recognize the "thisIsAMadeUpTag" as a tag, it > did not properly nest the tags content as children nodes. > > Is this expected because the tag is not a valid html tag or is this a bug? > > Maybe this is what you meant in your original email Enrique when you asked > which tags are "analyzed" by the html parser? > > Here is the output from running the program. Notice that all the valid html > tags are nested one level deeper than its parent tag. The "thisIsAMadeUpTag" > tag's "should be" children are not nested one level deeper. Is this a bug or > a feature? > > Tag (1[1,0],7[1,6]): html > Txt (7[1,6],9[2,1]): \n\t > Tag (9[2,1],15[2,7]): head > Txt (15[2,7],18[3,2]): \n\t\t > Tag (18[3,2],25[3,9]): title > Txt (25[3,9],44[3,28]): Html Parser Example > End (44[3,28],52[3,36]): /title > Txt (52[3,36],54[4,1]): \n\t > End (54[4,1],61[4,8]): /head > Txt (61[4,8],63[5,1]): \n\t > Tag (63[5,1],69[5,7]): body > Txt (69[5,7],72[6,2]): \n\t\t > Tag (72[6,2],75[6,5]): p > Txt (75[6,5],81[6,11]): Hello > Tag (81[6,11],87[6,17]): span > Txt (87[6,17],92[6,22]): World > End (92[6,22],99[6,29]): /span > Txt (99[6,29],100[6,30]): ! > End (100[6,30],104[6,34]): /p > Txt (104[6,34],107[7,2]): \n\t\t > Tag (107[7,2],110[7,5]): p > Tag (110[7,5],159[7,54]): thisIsAMadeUpTag name="don't try this at > home!" > Txt (159[7,54],195[7,90]): but html parser still understands it > End (195[7,90],214[7,109]): /thisIsAMadeUpTag > End (214[7,109],218[7,113]): /p > Txt (218[7,113],220[8,1]): \n\t > End (220[8,1],227[8,8]): /body > Txt (227[8,8],228[9,0]): \n > End (228[9,0],235[9,7]): /html > > > > > On Fri, Sep 10, 2010 at 11:53 AM, Elliot Huntington < > ell...@gm...> wrote: > >> I don't know exactly what you mean by "analyzes." But I think the answer >> to your question is all of them. >> >> Here is an example that might help you get started. You'll want to make >> sure you understand the various interfaces provided in the API (ie: Node, >> NodeFilter, etc...). >> >> import org.htmlparser.Parser; >> import org.htmlparser.filters.NodeClassFilter; >> import org.htmlparser.lexer.Lexer; >> import org.htmlparser.lexer.Page; >> import org.htmlparser.tags.Html; >> import org.htmlparser.util.NodeList; >> import org.htmlparser.util.ParserException; >> >> public class Example { >> public static void main(String... params) { >> // Parser parser = getParser(getHtml(), "UTF-8"); >> Parser parser = getParser(getHtml()); >> >> try { >> NodeList list = parser.extractAllNodesThatMatch(new >> NodeClassFilter(Html.class)); >> for(int i = 0; i < list.size(); i++) { >> Html html = (Html) list.elementAt(i); >> System.out.println(html.toString()); >> } >> } catch(ParserException e) { >> e.printStackTrace(); >> } >> >> } >> >> private static Parser getParser(String html, String charset) { >> return new Parser(new Lexer(new Page(html, charset))); >> } >> >> private static Parser getParser(String html) { >> Parser parser = new Parser(); >> try { >> parser.setInputHTML(html); >> } catch(ParserException e) { >> e.printStackTrace(); >> } >> return parser; >> } >> >> private static String getHtml() { >> return new StringBuilder() >> .append("\n<html>") >> .append("\n\t<head>") >> .append("\n\t\t<title>Html Parser Example</title>") >> .append("\n\t</head>") >> .append("\n\t<body>") >> .append("\n\t\t<p>Hello <span>World</span>!</p>") >> .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at >> home!\">but html parser still understands it</thisIsAMadeUpTag>") >> .append("\n\t</body>") >> .append("\n</html>") >> .toString(); >> } >> } >> >> >> >> >> On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm... >> > wrote: >> >>> Hello, >>> >>> can anybody tell me which html tags HtmlParser analyzes in order to >>> extract text from a web page??? >>> >>> Thank you!!! >>> >>> >>> ------------------------------------------------------------------------------ >>> Automate Storage Tiering Simply >>> Optimize IT performance and efficiency through flexible, powerful, >>> automated storage tiering capabilities. View this brief to learn how >>> you can reduce costs and improve performance. >>> http://p.sf.net/sfu/dell-sfdev2dev >>> _______________________________________________ >>> Htmlparser-developer mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >>> >>> >> >> >> -- >> Elliot >> > > > > -- > Elliot > > > ------------------------------------------------------------------------------ > Start uncovering the many advantages of virtual appliances > and start using them to simplify application deployment and > accelerate your shift to cloud computing > http://p.sf.net/sfu/novell-sfdev2dev > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > |