Re: [Htmlparser-developer] Tags

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Only composite tags are nested...
See http://htmlparser.sourceforge.net/faq.html#composite
So you would need to create a tag class derived from CoimpositeTag and add
it to the node factory, as outlined.

On Fri, Sep 10, 2010 at 8:14 PM, Elliot Huntington <
ell...@gm...> wrote:

> When I reviewed the output of the program a little closer I realized that
> although the HtmlParser did recognize the "thisIsAMadeUpTag" as a tag, it
> did not properly nest the tags content as children nodes.
>
> Is this expected because the tag is not a valid html tag or is this a bug?
>
> Maybe this is what you meant in your original email Enrique when you asked
> which tags are "analyzed" by the html parser?
>
> Here is the output from running the program. Notice that all the valid html
> tags are nested one level deeper than its parent tag. The "thisIsAMadeUpTag"
> tag's "should be" children are not nested one level deeper. Is this a bug or
> a feature?
>
> Tag (1[1,0],7[1,6]): html
>   Txt (7[1,6],9[2,1]): \n\t
>   Tag (9[2,1],15[2,7]): head
>     Txt (15[2,7],18[3,2]): \n\t\t
>     Tag (18[3,2],25[3,9]): title
>       Txt (25[3,9],44[3,28]): Html Parser Example
>       End (44[3,28],52[3,36]): /title
>     Txt (52[3,36],54[4,1]): \n\t
>     End (54[4,1],61[4,8]): /head
>   Txt (61[4,8],63[5,1]): \n\t
>   Tag (63[5,1],69[5,7]): body
>     Txt (69[5,7],72[6,2]): \n\t\t
>     Tag (72[6,2],75[6,5]): p
>       Txt (75[6,5],81[6,11]): Hello
>       Tag (81[6,11],87[6,17]): span
>         Txt (87[6,17],92[6,22]): World
>         End (92[6,22],99[6,29]): /span
>       Txt (99[6,29],100[6,30]): !
>       End (100[6,30],104[6,34]): /p
>     Txt (104[6,34],107[7,2]): \n\t\t
>     Tag (107[7,2],110[7,5]): p
>       Tag (110[7,5],159[7,54]): thisIsAMadeUpTag name="don't try this at
> home!"
>       Txt (159[7,54],195[7,90]): but html parser still understands it
>       End (195[7,90],214[7,109]): /thisIsAMadeUpTag
>       End (214[7,109],218[7,113]): /p
>     Txt (218[7,113],220[8,1]): \n\t
>     End (220[8,1],227[8,8]): /body
>   Txt (227[8,8],228[9,0]): \n
>   End (228[9,0],235[9,7]): /html
>
>
>
>
> On Fri, Sep 10, 2010 at 11:53 AM, Elliot Huntington <
> ell...@gm...> wrote:
>
>> I don't know exactly what you mean by "analyzes." But I think the answer
>> to your question is all of them.
>>
>> Here is an example that might help you get started. You'll want to make
>> sure you understand the various interfaces provided in the API (ie: Node,
>> NodeFilter, etc...).
>>
>> import org.htmlparser.Parser;
>> import org.htmlparser.filters.NodeClassFilter;
>> import org.htmlparser.lexer.Lexer;
>> import org.htmlparser.lexer.Page;
>> import org.htmlparser.tags.Html;
>> import org.htmlparser.util.NodeList;
>> import org.htmlparser.util.ParserException;
>>
>> public class Example {
>>     public static void main(String... params) {
>> //        Parser parser = getParser(getHtml(), "UTF-8");
>>         Parser parser = getParser(getHtml());
>>
>>         try {
>>             NodeList list = parser.extractAllNodesThatMatch(new
>> NodeClassFilter(Html.class));
>>             for(int i = 0; i < list.size(); i++) {
>>                 Html html = (Html) list.elementAt(i);
>>                 System.out.println(html.toString());
>>             }
>>         } catch(ParserException e) {
>>             e.printStackTrace();
>>         }
>>
>>     }
>>
>>     private static Parser getParser(String html, String charset) {
>>         return new Parser(new Lexer(new Page(html, charset)));
>>     }
>>
>>     private static Parser getParser(String html) {
>>         Parser parser = new Parser();
>>         try {
>>             parser.setInputHTML(html);
>>         } catch(ParserException e) {
>>             e.printStackTrace();
>>         }
>>         return parser;
>>     }
>>
>>     private static String getHtml() {
>>         return new StringBuilder()
>>             .append("\n<html>")
>>             .append("\n\t<head>")
>>             .append("\n\t\t<title>Html Parser Example</title>")
>>             .append("\n\t</head>")
>>             .append("\n\t<body>")
>>             .append("\n\t\t<p>Hello <span>World</span>!</p>")
>>             .append("\n\t\t<thisIsAMadeUpTag name=\"don't try this at
>> home!\">but html parser still understands it</thisIsAMadeUpTag>")
>>             .append("\n\t</body>")
>>             .append("\n</html>")
>>             .toString();
>>     }
>> }
>>
>>
>>
>>
>> On Fri, Sep 10, 2010 at 4:27 AM, Enrique Estelles <kik...@gm...
>> > wrote:
>>
>>> Hello,
>>>
>>> can anybody tell me which html tags HtmlParser analyzes in order to
>>> extract text from a web page???
>>>
>>> Thank you!!!
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Automate Storage Tiering Simply
>>> Optimize IT performance and efficiency through flexible, powerful,
>>> automated storage tiering capabilities. View this brief to learn how
>>> you can reduce costs and improve performance.
>>> http://p.sf.net/sfu/dell-sfdev2dev
>>> _______________________________________________
>>> Htmlparser-developer mailing list
>>> Htm...@li...
>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>>>
>>>
>>
>>
>> --
>> Elliot
>>
>
>
>
> --
> Elliot
>
>
> ------------------------------------------------------------------------------
> Start uncovering the many advantages of virtual appliances
> and start using them to simplify application deployment and
> accelerate your shift to cloud computing
> http://p.sf.net/sfu/novell-sfdev2dev
>
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>