Re: [Htmlparser-developer] Tags
Brought to you by:
derrickoswald
|
From: Elliot H. <ell...@gm...> - 2010-09-10 18:14:36
|
When I reviewed the output of the program a little closer I realized that
although the HtmlParser did recognize the "thisIsAMadeUpTag" as a tag, it
did not properly nest the tags content as children nodes.
Is this expected because the tag is not a valid html tag or is this a bug?
Maybe this is what you meant in your original email Enrique when you asked
which tags are "analyzed" by the html parser?
Here is the output from running the program. Notice that all the valid html
tags are nested one level deeper than its parent tag. The "thisIsAMadeUpTag"
tag's "should be" children are not nested one level deeper. Is this a bug or
a feature?
Tag (1[1,0],7[1,6]): html
Txt (7[1,6],9[2,1]): \n\t
Tag (9[2,1],15[2,7]): head
Txt (15[2,7],18[3,2]): \n\t\t
Tag (18[3,2],25[3,9]): title
Txt (25[3,9],44[3,28]): Html Parser Example
End (44[3,28],52[3,36]): /title
Txt (52[3,36],54[4,1]): \n\t
End (54[4,1],61[4,8]): /head
Txt (61[4,8],63[5,1]): \n\t
Tag (63[5,1],69[5,7]): body
Txt (69[5,7],72[6,2]): \n\t\t
Tag (72[6,2],75[6,5]): p
Txt (75[6,5],81[6,11]): Hello
Tag (81[6,11],87[6,17]): span
Txt (87[6,17],92[6,22]): World
End (92[6,22],99[6,29]): /span
Txt (99[6,29],100[6,30]): !
End (100[6,30],104[6,34]): /p
Txt (104[6,34],107[7,2]): \n\t\t
Tag (107[7,2],110[7,5]): p
Tag (110[7,5],159[7,54]): thisIsAMadeUpTag name="don't try this at
home!"
Txt (159[7,54],195[7,90]): but html parser still understands it
End (195[7,90],214[7,109]): /thisIsAMadeUpTag
End (214[7,109],218[7,113]): /p
Txt (218[7,113],220[8,1]): \n\t
End (220[8,1],227[8,8]): /body
Txt (227[8,8],228[9,0]): \n
End (228[9,0],235[9,7]): /html
On Fri, Sep 10, 2010 at 11:53 AM, Elliot Huntington <
ell...@gm...> wrote:
> I don't know exactly what you mean by "analyzes." But I think the answer to
> your question is all of them.
>
> Here is an example that might help you get started. You'll want to make
> sure you understand the various interfaces provided in the API (ie: Node,
> NodeFilter, etc...).
>
> import org.htmlparser.Parser;
> import org.htmlparser.filters.NodeClassFilter;
> import org.htmlparser.lexer.Lexer;
> import org.htmlparser.lexer.Page;
> import org.htmlparser.tags.Html;
> import org.htmlparser.util.NodeList;
> import org.htmlparser.util.ParserException;
>
> public class Example {
> public static void main(String... params) {
> // Parser parser = getParser(getHtml(), "UTF-8");
> Parser parser = getParser(getHtml());
>
> try {
> NodeList list = parser.extractAllNodesThatMatch(new
> NodeClassFilter(Html.class));
> for(int i = 0; i < list.size(); i++) {
> Html html = (Html) list.elementAt(i);
> System.out.println(html.toString());
> }
> } catch(ParserException e) {
> e.printStackTrace();
> }
>
> }
>
> private static Parser getParser(String html, String charset) {
> return new Parser(new Lexer(new Page(html, charset)));
> }
>
> private static Parser getParser(String html) {
> Parser parser = new Parser();
> try {
> parser.setInputHTML(html);
> } catch(ParserException e) {
> e.printStackTrace();
> }
> return parser;
> }
>
> private static String getHtml() {
> return new StringBuilder()
> .append("\n<html>")
> .append("\n\t<head>")
> .append("\n\t\t<title>Html Parser Example</title>")
> .append("\n\t</head>")
> .append("\n\t<body>")
> .append("\n\t\t<p>Hello <span>World</span>!</p>")
> .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at
> home!\">but html parser still understands it</thisIsAMadeUpTag>")
> .append("\n\t</body>")
> .append("\n</html>")
> .toString();
> }
> }
>
>
>
>
> On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...>wrote:
>
>> Hello,
>>
>> can anybody tell me which html tags HtmlParser analyzes in order to
>> extract text from a web page???
>>
>> Thank you!!!
>>
>>
>> ------------------------------------------------------------------------------
>> Automate Storage Tiering Simply
>> Optimize IT performance and efficiency through flexible, powerful,
>> automated storage tiering capabilities. View this brief to learn how
>> you can reduce costs and improve performance.
>> http://p.sf.net/sfu/dell-sfdev2dev
>> _______________________________________________
>> Htmlparser-developer mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>>
>>
>
>
> --
> Elliot
>
--
Elliot
|