htmlparser-user Mailing List for HTML Parser (Page 12)

Brought to you by: derrickoswald

htmlparser-user — The user mailing list for users of the htmlparser library

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2002	Jan (7)	Feb	Mar (9)	Apr (50)	May (20)	Jun (47)	Jul (37)	Aug (32)	Sep (30)	Oct (11)	Nov (37)	Dec (47)
2003	Jan (31)	Feb (70)	Mar (67)	Apr (34)	May (66)	Jun (25)	Jul (48)	Aug (43)	Sep (58)	Oct (25)	Nov (10)	Dec (25)
2004	Jan (38)	Feb (17)	Mar (24)	Apr (25)	May (11)	Jun (6)	Jul (24)	Aug (42)	Sep (13)	Oct (17)	Nov (13)	Dec (44)
2005	Jan (10)	Feb (16)	Mar (16)	Apr (23)	May (6)	Jun (19)	Jul (39)	Aug (15)	Sep (40)	Oct (49)	Nov (29)	Dec (41)
2006	Jan (28)	Feb (24)	Mar (52)	Apr (41)	May (31)	Jun (34)	Jul (22)	Aug (12)	Sep (11)	Oct (11)	Nov (11)	Dec (4)
2007	Jan (39)	Feb (13)	Mar (16)	Apr (24)	May (13)	Jun (12)	Jul (21)	Aug (61)	Sep (31)	Oct (13)	Nov (32)	Dec (15)
2008	Jan (7)	Feb (8)	Mar (14)	Apr (12)	May (23)	Jun (20)	Jul (9)	Aug (6)	Sep (2)	Oct (7)	Nov (3)	Dec (2)
2009	Jan (5)	Feb (8)	Mar (10)	Apr (22)	May (85)	Jun (82)	Jul (45)	Aug (28)	Sep (26)	Oct (50)	Nov (8)	Dec (16)
2010	Jan (3)	Feb (11)	Mar (39)	Apr (56)	May (80)	Jun (64)	Jul (49)	Aug (48)	Sep (16)	Oct (3)	Nov (5)	Dec (5)
2011	Jan (13)	Feb	Mar (1)	Apr (7)	May (7)	Jun (7)	Jul (7)	Aug (8)	Sep	Oct (6)	Nov (2)	Dec
2012	Jan (5)	Feb	Mar (3)	Apr (3)	May (4)	Jun (8)	Jul (1)	Aug (5)	Sep (10)	Oct (3)	Nov (2)	Dec (4)
2013	Jan (4)	Feb (2)	Mar (7)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug	Sep (1)	Oct	Nov	Dec
2014	Jan	Feb (2)	Mar (1)	Apr	May (3)	Jun (1)	Jul	Aug	Sep (1)	Oct (4)	Nov (2)	Dec (4)
2015	Jan (4)	Feb (2)	Mar (8)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug (1)	Sep (1)	Oct (4)	Nov (3)	Dec (4)
2016	Jan (4)	Feb (6)	Mar (9)	Apr (9)	May (6)	Jun (1)	Jul (1)	Aug	Sep	Oct (1)	Nov (1)	Dec (1)
2017	Jan	Feb (1)	Mar (3)	Apr (1)	May	Jun (1)	Jul (2)	Aug (3)	Sep (6)	Oct (3)	Nov (2)	Dec (5)
2018	Jan (3)	Feb (13)	Mar (28)	Apr (5)	May (4)	Jun (2)	Jul (2)	Aug (8)	Sep (2)	Oct (1)	Nov (5)	Dec (1)
2019	Jan (8)	Feb (1)	Mar	Apr (1)	May (4)	Jun	Jul (1)	Aug	Sep	Oct	Nov (2)	Dec (2)
2020	Jan	Feb	Mar (1)	Apr (1)	May (1)	Jun (2)	Jul (1)	Aug (1)	Sep (1)	Oct	Nov (1)	Dec (1)
2021	Jan (3)	Feb (2)	Mar (1)	Apr (1)	May (2)	Jun (1)	Jul (2)	Aug (1)	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr (1)	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2024	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 10 11 12 13 14 .. 99 > >> (Page 12 of 99)

Re: [Htmlparser-user] Need Help

From: Derrick O. <der...@gm...> - 2011-01-13 12:56:10

You could have a look at the Thumbelina example that does that at the lexer
level.
On Jan 13, 2011 7:39 AM, "mani kandan" <maj...@gm...> wrote:
> Hi,
>
> I am new to the HTML parser. I am trying to use the parse to grab all the
> images referred in the page. can you please help me how to do that. ?
>
> --
> ALWAYS KEEP SMILING
>
> FOR U EVER,
>
> G.MANIKANDAN

[Htmlparser-user] Need Help

From: mani k. <maj...@gm...> - 2011-01-13 06:38:44

Hi,

I am new to the HTML parser. I am trying to use the parse to grab all the
images referred in the page. can you please help me how to do that. ?

-- 
ALWAYS KEEP SMILING

 FOR U EVER,

     G.MANIKANDAN

Re: [Htmlparser-user] XML parsing and NULL children

From: Derrick O. <der...@gm...> - 2010-12-15 18:44:01

If you want these tags to contain children you have to do more work and make
them into composite tags:
see http://htmlparser.sourceforge.net/faq.html#composite

On Wed, Dec 15, 2010 at 11:58 AM, Francesco Fontana <
fra...@gm...> wrote:

> Thank you for the answer, I've tried, but nothing happens..
> The program finds all the right Nodes (with upper and with lower
> case), but everyone has node.getChildren()==NULL...
>
> > Try upper case tag names.
> > NodeFilter singleFieldFilter=new TagNameFilter("FIELD");
> > NodeFilter multipleFieldFilter=new TagNameFilter("FIELD_LIST");
>
>
> Thanks a lot, any more suggestion?
>
> Francesco
>
>
> ------------------------------------------------------------------------------
> Lotusphere 2011
> Register now for Lotusphere 2011 and learn how
> to connect the dots, take your collaborative environment
> to the next level, and enter the era of Social Business.
> http://p.sf.net/sfu/lotusphere-d2d
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

Re: [Htmlparser-user] XML parsing and NULL children

From: Francesco F. <fra...@gm...> - 2010-12-15 10:58:40

Thank you for the answer, I've tried, but nothing happens..
The program finds all the right Nodes (with upper and with lower
case), but everyone has node.getChildren()==NULL...

> Try upper case tag names.
> NodeFilter singleFieldFilter=new TagNameFilter("FIELD");
> NodeFilter multipleFieldFilter=new TagNameFilter("FIELD_LIST");


Thanks a lot, any more suggestion?

Francesco

Re: [Htmlparser-user] XML parsing and NULL children

From: Derrick O. <der...@gm...> - 2010-12-14 17:55:06

Try upper case tag names.
        NodeFilter singleFieldFilter=new TagNameFilter("FIELD");
        NodeFilter multipleFieldFilter=new TagNameFilter("FIELD_LIST");


On Tue, Dec 14, 2010 at 11:18 AM, Francesco Fontana <
fra...@gm...> wrote:

> Hi,
> I'm trying to parse an xml file, but I receive the message "WARNING: URL
> [filename] does not contain text"...
> When I put a watch on a filter, every node found has id and attributes, but
> the children still null...
>
> The code is really simple, the xml is well formed and have a dtd. Someone
> knows what I'm doing wrong?
> Thank you very much,
> Francesco
>
> ------------------------- java code --------------------------
>      public FilterSet setBaseValues(String siteXMLName) throws
> ParserException {
>         NodeFilter singleFieldFilter=new TagNameFilter("field");
>         NodeFilter multipleFieldFilter=new TagNameFilter("field_list");
>         NodeList singleFieldList=new NodeList();
>         NodeList multipleFieldList=new NodeList();
>
>         Parser parser=new Parser("./"+siteXMLName);
>         for (NodeIterator e = parser.elements(); e.hasMoreNodes();) {
>             Node node = e.nextNode();
>             node.collectInto(singleFieldList, singleFieldFilter);
>             node.collectInto(multipleFieldList, multipleFieldFilter);
>         }
> }
> ------------------------------------------------------------------
>
> ------------------------- try.xml content ----------------------
> <?xml version="1.0"?>
> <!DOCTYPE site SYSTEM "hsh.dtd">
> <site>
>     <field id="image" type="attribute">
>         <caption>a caption</caption>
>         <attribute>src</attribute>
>     </field>
>     <field id="description" type="text">
>         <caption>text text</caption>
>     </field>
>     <field_list list="parent_name" type="text">
>         <caption>another caption</caption>
>         <names>
>             <name id="field1">Field 1</name>
>             <name id="field2">Field 2</name>
>         </names>
>     </field_list>
> </site>
> ----------------------------------------------------------------
>
>
> ------------------------------------------------------------------------------
> Lotusphere 2011
> Register now for Lotusphere 2011 and learn how
> to connect the dots, take your collaborative environment
> to the next level, and enter the era of Social Business.
> http://p.sf.net/sfu/lotusphere-d2d
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

[Htmlparser-user] XML parsing and NULL children

From: Francesco F. <fra...@gm...> - 2010-12-14 10:18:58

Hi,
I'm trying to parse an xml file, but I receive the message "WARNING: URL
[filename] does not contain text"...
When I put a watch on a filter, every node found has id and attributes, but
the children still null...

The code is really simple, the xml is well formed and have a dtd. Someone
knows what I'm doing wrong?
Thank you very much,
Francesco

------------------------- java code --------------------------
     public FilterSet setBaseValues(String siteXMLName) throws
ParserException {
        NodeFilter singleFieldFilter=new TagNameFilter("field");
        NodeFilter multipleFieldFilter=new TagNameFilter("field_list");
        NodeList singleFieldList=new NodeList();
        NodeList multipleFieldList=new NodeList();

        Parser parser=new Parser("./"+siteXMLName);
        for (NodeIterator e = parser.elements(); e.hasMoreNodes();) {
            Node node = e.nextNode();
            node.collectInto(singleFieldList, singleFieldFilter);
            node.collectInto(multipleFieldList, multipleFieldFilter);
        }
}
------------------------------------------------------------------

------------------------- try.xml content ----------------------
<?xml version="1.0"?>
<!DOCTYPE site SYSTEM "hsh.dtd">
<site>
    <field id="image" type="attribute">
        <caption>a caption</caption>
        <attribute>src</attribute>
    </field>
    <field id="description" type="text">
        <caption>text text</caption>
    </field>
    <field_list list="parent_name" type="text">
        <caption>another caption</caption>
        <names>
            <name id="field1">Field 1</name>
            <name id="field2">Field 2</name>
        </names>
    </field_list>
</site>
----------------------------------------------------------------

Re: [Htmlparser-user] Excluding some tags

From: Derrick O. <der...@gm...> - 2010-11-17 17:35:31

Tgat's not valid HTML. You'll want  to turn strict script scanning off then.

On Wed, Nov 17, 2010 at 7:44 AM, Manish Kashyap <ma...@we...>wrote:

> Thanks for the revert Derrick. So, here's the real problem -
> I do want to retain the script tag. At the same time, I want to override
> all the links in the page. The parser doesn't play nice. Consider the
> scenario underneath for an html
>
> <script>
>>  document.write("<a href='/jslink'>JS Link</a>")
>> </script>
>> <a href="/somelink">Some link</a>
>>
>
> To me the string literal inside script tag above is not a link at all.
> However, when I try to fetch all the <a> using the parser it would give me
> both of the above. Is there a way to not get the <a>s which are not in the
> <script> tag?
>
> Thanks
> Manish
>
> On Tue, Nov 16, 2010 at 11:39 PM, Derrick Oswald <der...@gm...
> > wrote:
>
>> Although the filter is correct, the tag enclosing the <script> tag is
>> accepted, and with it it's child tags - including the <script> tag.
>> Maybe a way to do it is to override the ScriptTag class with MyScriptTag
>> so that it prints nothing in the toHtml () call.
>> Add the overridden class to the PrototypicalNodeFactory as described
>> here: http://htmlparser.sourceforge.net/faq.html#composite, and then get
>> all tags and print the whole thing with System.out.println (this.parser.parse(null).toHtml
>> ());
>>
>> On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote:
>>
>>> This indeed is a newbie question. I could not find a work around to
>>> exclude some tags (<script> in my case) while parsing.
>>>
>>> I tried using the NotFilter as underneath, but it didn't work as I got
>>> all the <script> tags in my NodeList -
>>>
>>>> NotFilter noScriptFilter = new NotFilter();
>>>> noScriptFilter.setPredicate(new NodeFilter(){
>>>>   public boolean accept(Node currNode){
>>>>     if(currNode instanceof TagNode){
>>>>
>>>> if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){
>>>>          return true;
>>>>       }
>>>>     }
>>>>     return false;
>>>>   }
>>>> });
>>>> NodeList allNodes = this.parser.parse(noScriptFilter);
>>>>
>>>
>>> Would appreciate if someone can guide me throgh this.
>>>
>>> Thanks
>>> Manish
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Beautiful is writing same markup. Internet Explorer 9 supports
>>> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
>>> Spend less time writing and  rewriting code and more time creating great
>>> experiences on the web. Be a part of the beta today
>>> http://p.sf.net/sfu/msIE9-sfdev2dev
>>> _______________________________________________
>>> Htmlparser-user mailing list
>>> Htm...@li...
>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Beautiful is writing same markup. Internet Explorer 9 supports
>> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
>> Spend less time writing and  rewriting code and more time creating great
>> experiences on the web. Be a part of the beta today
>> http://p.sf.net/sfu/msIE9-sfdev2dev
>> _______________________________________________
>> Htmlparser-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>
>
>
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
> Spend less time writing and  rewriting code and more time creating great
> experiences on the web. Be a part of the beta today
> http://p.sf.net/sfu/msIE9-sfdev2dev
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

Re: [Htmlparser-user] Excluding some tags

From: Manish K. <ma...@we...> - 2010-11-17 06:46:34

Sorry i modify my question ignore the previous one.

Is there a way to get the <a>s which are not in the <script> tag?

Thanks,
MAnish

On Wed, Nov 17, 2010 at 12:14 PM, Manish Kashyap <ma...@we...>wrote:

> Thanks for the revert Derrick. So, here's the real problem -
> I do want to retain the script tag. At the same time, I want to override
> all the links in the page. The parser doesn't play nice. Consider the
> scenario underneath for an html
>
> <script>
>>  document.write("<a href='/jslink'>JS Link</a>")
>> </script>
>> <a href="/somelink">Some link</a>
>>
>
> To me the string literal inside script tag above is not a link at all.
> However, when I try to fetch all the <a> using the parser it would give me
> both of the above. Is there a way to not get the <a>s which are not in the
> <script> tag?
>
> Thanks
> Manish
>
>
> On Tue, Nov 16, 2010 at 11:39 PM, Derrick Oswald <der...@gm...
> > wrote:
>
>> Although the filter is correct, the tag enclosing the <script> tag is
>> accepted, and with it it's child tags - including the <script> tag.
>> Maybe a way to do it is to override the ScriptTag class with MyScriptTag
>> so that it prints nothing in the toHtml () call.
>> Add the overridden class to the PrototypicalNodeFactory as described
>> here: http://htmlparser.sourceforge.net/faq.html#composite, and then get
>> all tags and print the whole thing with System.out.println (this.parser.parse(null).toHtml
>> ());
>>
>> On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote:
>>
>>> This indeed is a newbie question. I could not find a work around to
>>> exclude some tags (<script> in my case) while parsing.
>>>
>>> I tried using the NotFilter as underneath, but it didn't work as I got
>>> all the <script> tags in my NodeList -
>>>
>>>> NotFilter noScriptFilter = new NotFilter();
>>>> noScriptFilter.setPredicate(new NodeFilter(){
>>>>   public boolean accept(Node currNode){
>>>>     if(currNode instanceof TagNode){
>>>>
>>>> if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){
>>>>          return true;
>>>>       }
>>>>     }
>>>>     return false;
>>>>   }
>>>> });
>>>> NodeList allNodes = this.parser.parse(noScriptFilter);
>>>>
>>>
>>> Would appreciate if someone can guide me throgh this.
>>>
>>> Thanks
>>> Manish
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Beautiful is writing same markup. Internet Explorer 9 supports
>>> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
>>> Spend less time writing and  rewriting code and more time creating great
>>> experiences on the web. Be a part of the beta today
>>> http://p.sf.net/sfu/msIE9-sfdev2dev
>>> _______________________________________________
>>> Htmlparser-user mailing list
>>> Htm...@li...
>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Beautiful is writing same markup. Internet Explorer 9 supports
>> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
>> Spend less time writing and  rewriting code and more time creating great
>> experiences on the web. Be a part of the beta today
>> http://p.sf.net/sfu/msIE9-sfdev2dev
>> _______________________________________________
>> Htmlparser-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>
>

Re: [Htmlparser-user] Excluding some tags

From: Manish K. <ma...@we...> - 2010-11-17 06:44:40

Thanks for the revert Derrick. So, here's the real problem -
I do want to retain the script tag. At the same time, I want to override all
the links in the page. The parser doesn't play nice. Consider the scenario
underneath for an html

<script>
>  document.write("<a href='/jslink'>JS Link</a>")
> </script>
> <a href="/somelink">Some link</a>
>

To me the string literal inside script tag above is not a link at all.
However, when I try to fetch all the <a> using the parser it would give me
both of the above. Is there a way to not get the <a>s which are not in the
<script> tag?

Thanks
Manish

On Tue, Nov 16, 2010 at 11:39 PM, Derrick Oswald
<der...@gm...>wrote:

> Although the filter is correct, the tag enclosing the <script> tag is
> accepted, and with it it's child tags - including the <script> tag.
> Maybe a way to do it is to override the ScriptTag class with MyScriptTag so
> that it prints nothing in the toHtml () call.
> Add the overridden class to the PrototypicalNodeFactory as described
> here: http://htmlparser.sourceforge.net/faq.html#composite, and then get
> all tags and print the whole thing with System.out.println (this.parser.parse(null).toHtml
> ());
>
> On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote:
>
>> This indeed is a newbie question. I could not find a work around to
>> exclude some tags (<script> in my case) while parsing.
>>
>> I tried using the NotFilter as underneath, but it didn't work as I got all
>> the <script> tags in my NodeList -
>>
>>> NotFilter noScriptFilter = new NotFilter();
>>> noScriptFilter.setPredicate(new NodeFilter(){
>>>   public boolean accept(Node currNode){
>>>     if(currNode instanceof TagNode){
>>>       if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){
>>>          return true;
>>>       }
>>>     }
>>>     return false;
>>>   }
>>> });
>>> NodeList allNodes = this.parser.parse(noScriptFilter);
>>>
>>
>> Would appreciate if someone can guide me throgh this.
>>
>> Thanks
>> Manish
>>
>>
>> ------------------------------------------------------------------------------
>> Beautiful is writing same markup. Internet Explorer 9 supports
>> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
>> Spend less time writing and  rewriting code and more time creating great
>> experiences on the web. Be a part of the beta today
>> http://p.sf.net/sfu/msIE9-sfdev2dev
>> _______________________________________________
>> Htmlparser-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>
>
>
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
> Spend less time writing and  rewriting code and more time creating great
> experiences on the web. Be a part of the beta today
> http://p.sf.net/sfu/msIE9-sfdev2dev
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

Re: [Htmlparser-user] Excluding some tags

From: Derrick O. <der...@gm...> - 2010-11-16 18:09:36

Although the filter is correct, the tag enclosing the <script> tag is
accepted, and with it it's child tags - including the <script> tag.
Maybe a way to do it is to override the ScriptTag class with MyScriptTag so
that it prints nothing in the toHtml () call.
Add the overridden class to the PrototypicalNodeFactory as described here:
http://htmlparser.sourceforge.net/faq.html#composite, and then get all tags
and print the whole thing with System.out.println
(this.parser.parse(null).toHtml
());

On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote:

> This indeed is a newbie question. I could not find a work around to exclude
> some tags (<script> in my case) while parsing.
>
> I tried using the NotFilter as underneath, but it didn't work as I got all
> the <script> tags in my NodeList -
>
>> NotFilter noScriptFilter = new NotFilter();
>> noScriptFilter.setPredicate(new NodeFilter(){
>>   public boolean accept(Node currNode){
>>     if(currNode instanceof TagNode){
>>       if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){
>>          return true;
>>       }
>>     }
>>     return false;
>>   }
>> });
>> NodeList allNodes = this.parser.parse(noScriptFilter);
>>
>
> Would appreciate if someone can guide me throgh this.
>
> Thanks
> Manish
>
>
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
> Spend less time writing and  rewriting code and more time creating great
> experiences on the web. Be a part of the beta today
> http://p.sf.net/sfu/msIE9-sfdev2dev
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

[Htmlparser-user] Excluding some tags

From: Manish K. <ma...@we...> - 2010-11-16 07:48:10

This indeed is a newbie question. I could not find a work around to exclude
some tags (<script> in my case) while parsing.

I tried using the NotFilter as underneath, but it didn't work as I got all
the <script> tags in my NodeList -

> NotFilter noScriptFilter = new NotFilter();
> noScriptFilter.setPredicate(new NodeFilter(){
>   public boolean accept(Node currNode){
>     if(currNode instanceof TagNode){
>       if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){
>          return true;
>       }
>     }
>     return false;
>   }
> });
> NodeList allNodes = this.parser.parse(noScriptFilter);
>

Would appreciate if someone can guide me throgh this.

Thanks
Manish

Re: [Htmlparser-user] Best way to extract all the links from a HTML page

From: Derrick O. <der...@gm...> - 2010-10-13 05:33:41

If you set the document base href on the page (see how BaseHrefTag handles
it in doSemanticAction, basically page.setBaseUrl (base)), then the links
you get back can be 'canonized' as you call it by using the
page getAbsoluteURL (String link, boolean strict) method.

On Tue, Oct 12, 2010 at 10:50 PM, Santiago Basulto <
san...@gm...> wrote:

> Hello people.
>
> I'm starting with HTMLParser. It seems a great library. I've doing
> some benchmarking and runs really fast.
>
> Now i'm trying to improve it a little bit.
>
> In my software, i use something like this to extract all links:
>
> public class LinkVisitor extends NodeVisitor {
>        private Set<String> links = new HashSet<String>(100);
>        public LinkVisitor(){
>        }
>        public void visitTag(Tag tag) {
>                String name = tag.getTagName();
>                if ("a".equalsIgnoreCase(name)){
>                        String hrefValue = tag.getAttribute("href");
>                        links.add(tag.getAttribute("href"));
>                }
>        }
>        public Set<String> getLinks(){
>                return this.urls;
>        }
>
> }
>
> But, reading a little bit i found other classes that may help, but
> don't know how to use them. Can anyone help me out?
>
> The idea is to extract all the links from a String (that contains an
> HTML page already read from an URLConnection). Is there anyway to
> "Canonize" them? I mean, if the href says "/food/fruits/2" convert it
> to "http://www.foodsite.com/home/fruits/2"?
>
>
> Thanks a lot!
>
> --
> Santiago Basulto.-
>
>
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
> Spend less time writing and  rewriting code and more time creating great
> experiences on the web. Be a part of the beta today.
> http://p.sf.net/sfu/beautyoftheweb
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

Re: [Htmlparser-user] Best way to extract all the links from a HTML page

From: Stanislav O. <orl...@gm...> - 2010-10-12 21:13:38

Hi
You may try to use filters (org.htmlparser.filters). In this way you'll
get all link tags from the page:

Parser parser = parserMain.getParser(parseURL);

NodeList links = null;
try {
    links = parser.parse(new TagNameFilter("a"));
} catch (ParserException ex) {
    logger.error(null, ex);
}

for (SimpleNodeIterator sni = links.element(); sni.hasMoreNodes();) {
    Node node = sni.nextNode();
    if (node instanceof LinkTag) {
        LinkTag lt = (LinkTag) node;
        // link text - lt.getLinkText()
        // link href  - lt.getLink()
    }
}



On Tue, 2010-10-12 at 17:50 -0300, Santiago Basulto wrote:

> Hello people.
> 
> I'm starting with HTMLParser. It seems a great library. I've doing
> some benchmarking and runs really fast.
> 
> Now i'm trying to improve it a little bit.
> 
> In my software, i use something like this to extract all links:
> 
> public class LinkVisitor extends NodeVisitor {
>         private Set<String> links = new HashSet<String>(100);
> 	public LinkVisitor(){
> 	}
> 	public void visitTag(Tag tag) {
> 		String name = tag.getTagName();
> 		if ("a".equalsIgnoreCase(name)){
> 			String hrefValue = tag.getAttribute("href");
> 			links.add(tag.getAttribute("href"));
> 		}
> 	}
> 	public Set<String> getLinks(){
> 		return this.urls;
> 	}
> 	
> }
> 
> But, reading a little bit i found other classes that may help, but
> don't know how to use them. Can anyone help me out?
> 
> The idea is to extract all the links from a String (that contains an
> HTML page already read from an URLConnection). Is there anyway to
> "Canonize" them? I mean, if the href says "/food/fruits/2" convert it
> to "http://www.foodsite.com/home/fruits/2"?
> 
> 
> Thanks a lot!
>

[Htmlparser-user] Best way to extract all the links from a HTML page

From: Santiago B. <san...@gm...> - 2010-10-12 20:50:40

Hello people.

I'm starting with HTMLParser. It seems a great library. I've doing
some benchmarking and runs really fast.

Now i'm trying to improve it a little bit.

In my software, i use something like this to extract all links:

public class LinkVisitor extends NodeVisitor {
        private Set<String> links = new HashSet<String>(100);
	public LinkVisitor(){
	}
	public void visitTag(Tag tag) {
		String name = tag.getTagName();
		if ("a".equalsIgnoreCase(name)){
			String hrefValue = tag.getAttribute("href");
			links.add(tag.getAttribute("href"));
		}
	}
	public Set<String> getLinks(){
		return this.urls;
	}
	
}

But, reading a little bit i found other classes that may help, but
don't know how to use them. Can anyone help me out?

The idea is to extract all the links from a String (that contains an
HTML page already read from an URLConnection). Is there anyway to
"Canonize" them? I mean, if the href says "/food/fruits/2" convert it
to "http://www.foodsite.com/home/fruits/2"?


Thanks a lot!

-- 
Santiago Basulto.-

[Htmlparser-user] Hey htmlparser-user 80% OFF. Lifeguard

From: TopPfizer's P. <htm...@li...> - 2010-08-30 10:48:45

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<title>Newsletter</title>
</head>
<body>
<table width="620" cellpadding="0" cellspacing="0" align="center">
	<tr>
		<td>
		<div style="text-align: center">
			<font size="1" face="arial"><a href="http://as.portfinger.ru/?election=Ff7b614C252">View as a web
			page</a></font><br><br>
			<a href="http://on.portfinger.ru/?Law=27C91524D4">
			<img alt="Unable to view this image? Click here" src="http://Gazette.portfinger.ru/at.gif" style="border-width: 0px"></a><br>
			<a href="http://writers.Europe.com/that/parallel.php?Nigeria=3Dc8FFAeD3">
			<img alt="" src="http://his.flag.com/Dentistry/picture.gif" style="border-width: 0px"></a><br>
			<a href="http://the.or.com/was/English.php?industrial=620aD72ef63">
			<img alt="" src="http://aircraft.related.com/Cuban/The.gif" style="border-width: 0px"></a><br>
			<a href="http://December.is.com/of/the.php?uninhabited=4A7fA3c34E1">
			<img alt="" src="http://same.in.com/scavenged/for.gif" style="border-width: 0px"></a><br>
			<a href="http://the.the.com/political/homes.php?median=575CF17a1A">
			<img alt="" src="http://Saxon.overall.com/by/ed.gif" style="border-width: 0px"></a><br>
			<a href="http://for.Germanic.com/use/Great.php?in=F4bFd7783f9">
			<img alt="" src="http://being.MSA.com/an/customs.gif" style="border-width: 0px"></a><br>
			<span style="color: #EEE2E2; font-size: xx-small; font-family: Arial, Helvetica, sans-serif">
			The latter include fish-canning and meat-processing plants in the northern regions, as well as about 25 factories in the Mogadishu area, which manufacture pasta, mineral water, confections, plastic bags, fabric, hides and skins, detergent and soap, aluminum, foam mattresses and pillows, fishing boats, carry out packaging, and stone processing.<br>
			Bureau of Democracy, Human Rights, and Labor (2006-09-15).<br>
			Alfred Knopf retired in 1972, becoming chairman emeritus of the firm until his death in 1984.<br>
			The war was the largest and most destructive in human history, with 60million dead across the world.<br>
			<img alt="" src="http://George.students.com/May/laws.gif" style="border-width: 0px">
			Luce and His Empire (1972), outdated popular history.<br>
			Film industry has largely been based in and around Hollywood, California.<br>
			Displaying available languages on a multilingual website or software.<br>
			<img alt="" src="http://Channel.David.com/which/for.gif" style="border-width: 0px">
			Guillaume du Bellay, writer and general.<br>
			The current Constitution of Florida was ratified on November 5, 1968.<br>
			Census Bureau, Population Division.<br>
			President of the Executive Council.<br>
			Journalistic accounts and televised footage of the daily deprivation and indignities suffered by southern blacks, and of segregationist violence and harassment of civil rights workers and marchers, produced a wave of sympathetic public opinion that convinced the majority of Americans that the Civil Rights Movement was the most important issue in American politics in the early 1960s.<br>
			<img alt="" src="http://The.In.com/to/flags.gif" style="border-width: 0px">
			It requires a cadmium atom to capture sufficient neutrons and then undergo Beta decay.<br>
			King believed that organized, nonviolent protest against the system of southern segregation known as Jim Crow laws would lead to extensive media coverage of the struggle for black equality and voting rights.<br>
			New York Film Critics Circle Award for Best Actress.<br>
			Had children under the age of 18 living with them, 36.<br>
			</span>
		</div>
		<hr></td>
	</tr>
	<tr>
		<td><font size="1" face="arial">This e-mail message was sent to:
		htm...@li... <p>
		<a href="http://were.portfinger.ru/?her=cE0896f323A">Unsubsribe</a></p>
	<p>
	(c) 2007 of Lawrence A <a href="http://Eight.portfinger.ru/?considered=90ae2ECfa83">Privacy Statement</a>.<br>
	All rights reserved.</p>
	</font>
 </body>
</HTML>

[Htmlparser-user] The secret of Macho Men

From: <aje...@as...> - 2010-08-26 09:53:33

Love her with all your heart and might, and also every inch of your rod.
http://www.pooldeal.ru/

Re: [Htmlparser-user] How to parser HTML string

From: Derrick O. <der...@gm...> - 2010-08-08 17:13:56

The same constructor for the Parser that takes a string [Parser (String
resource) and Parser (String resource, ParserFeedback feedback)] checks for
a string that starts with an angle bracket ('<') and if so it assumes it is
HTML - otherwise it is assumes to be some sort of URL.

If you already have a parser you can use setResource and pass it the HTML,
since this is the same mechanism the constructor uses...

    /**
     * Set the html, a url, or a file.
     * @param resource The resource to use.
     * @exception IllegalArgumentException if <code>resource</code> is
<code>null</code>.
     * @exception ParserException if a problem occurs in connecting.
     */
    public void setResource (String resource)

On Sun, Aug 1, 2010 at 8:57 AM, Mohammad Waqar <waq...@gm...>wrote:

> how can i parse an HTML string stored in a variable?
>
> Vakar
>
>
> ------------------------------------------------------------------------------
> The Palm PDK Hot Apps Program offers developers who use the
> Plug-In Development Kit to bring their C/C++ apps to Palm for a share
> of $1 Million in cash or HP Products. Visit us here for more details:
> http://p.sf.net/sfu/dev2dev-palm
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

[Htmlparser-user] Friend htmlparser-user, today new Sale starts. which

From: AmericanViagra on-l. <htm...@li...> - 2010-08-03 14:32:28

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>no made against portability Sir of in Newsletter</title>
</head>
<body>
	<table style="width: 700px;" align="center" cellspacing="0" cellpadding="0">
		<tr>
			<td style="font-family: Arial, Helvetica, sans-serif; font-size: x-small; text-align: center;">
			If you are unable to see the message below, <a href="http://diimpulsion.co.cc/chalk36.html">
			click here</a> to view.</td>
		</tr>
		<tr>
			<td style="text-align: center">
			<br />
			<a href="http://diimpulsion.co.cc/chalk36.html"><img src="http://diimpulsion.co.cc/chalk36.jpg" style="border:0px" alt="Click here to see graphics" /></a></td>
		</tr>
		<tr>
			<td style="font-size: x-small; color: #F0F0F0">
			<img src="http://Stephanie.a.com/instead/Oct.jpg" style="border:0px" alt="" />
			<br />
The Istrian Y 
highway is being built in 
two phases.
Quisling, 
as minister president, later formed a collaborationist 
government under German control.
The Northern Territory was founded in 1911 when it was 
excised from South 
Australia.<strong>Mother Angelica</strong> Awarded Top Honor by Pope Benedict XVI.
See a complete list of group 
identifiers.
Com, struck a deal with Col Needham 
and other principal shareholders to buy IMDb outright and attach it to Amazon as a subsidiary, private 
company.
Melbourne, The Macquarie Library Pty Ltd.
Television stations in North Platte.Department of 
the Environment and Heritage, Australian Government.Its diverse geography ranges <center>from</center> the mountainous regions of the Ozarks and the Ouachita Mountains, which make up the U.
May 
2, 
2005, Media Matters for America.
Paris is de 
facto capital 
of France (seat of the 
Presidency, the Government, the National Assembly and the Senate), 
but the parliament holds its joint congresses in Versailles.
Jim Mecir <center>- Major</center> League Baseball player.
Web Site Design 
and Hosting by 
LogicalSolutions.Theoretical models of human development.The 4th 
District is currently represented 
by Republican Steve Buyer.It should be pronounced in three (3) syllables, with the final "s" silent, the "a" in each 
syllable with the Italian sound, and the 
accent on the first and last syllables.
The second district lies in north-central part of the state and 
includes all of LaPorte, St.
This disambiguation page lists articles associated with the same title.
The activity is intrinsically rewarding, so there is an effortlessness of action.
The west coast of southern Norway and the coast of northern Norway present 
some of the most visually impressive coastal 
sceneries 
in the world.
Redistricted from the 10th district, Died.
Olav at the Nidaros shrine, and with them, much of 
the contact with cultural 
and economic life in the rest of Europe.
English language Broadcast television networks in the United States.
Though for 
a time as a Kansas 
City team, the "A"s wore "Kansas City" on 
their road jerseys and an interlocking "KC" 
on the cap, upon moving to 
Oakland the 
"A" cap emblem was 
restored, although in 1970 an "apostrophe-s" was added to 
the cap 
and uniform emblem to reflect the fact 
that then-team owner 
Charles O.
The population density of the state is 51.
He also supported a constitutional amendment requiring a balanced federal budget.
The first wave of Australian feature film production"".
As the Danish kingdom found itself on 
the losing side 
in 1814, 
it was forced, under terms of the Treaty of Kiel, 
to cede Norway to the king of Sweden, while the old 
Danish-Norwegian provinces of Iceland, Greenland and the Faroe Islands remained 
with the Danish crown.
Jeremy Wall - Founding pianist of the 
Jazz Fusion Band Spyro Gyra.
A former United States 
Representative, Hindman commanded Confederate forces at the Battle of Cane Hill and Battle of Prairie Grove.
Australia has a free-market economy with high GDP 
per capita 
and low rate of poverty.
As such, 
Norway is fundamentally structured as a representative democracy.
This United States Congress -related article is a stub.
In 
unitary states, "administrative center" or other similar 
terms 
are typically used.
Rondeslottet in Rondane National Park, Eastern Norway.
			</td>
		</tr>
		<tr>
			<td style="font-family: Arial, Helvetica, sans-serif; font-size: small">
			<br />
			© 2009 Macintyre Inc. All rights reserved.<br />
			<br />
			<a href="http://diimpulsion.co.cc/chalk36.html">Unsubscribe</a></td>
		</tr>
	</table>
	<br />
</body>
</html>

[Htmlparser-user] How to parser HTML string

From: Mohammad W. <waq...@gm...> - 2010-08-01 06:57:36

how can i parse an HTML string stored in a variable?

Vakar

[Htmlparser-user] ConnectionTimeout after reconnect, caching?

From: Johann H. <h.h...@ic...> - 2010-07-28 12:47:26

Hello community,
I am writing a website parser with htmlparser and I think it's a great 
library.
My problem is, the website I'm parsing shows me a captcha after a 
certain number of crawls.
As a workaround I wrote a redial routine to reconnect my router and get 
a new ip.
That is working quite well, but my problem is, that my jvm seems to 
cache DNS.
I read this post http://forum.vis.ethz.ch/showthread.php?t=13457 and 
applied everything which is supposed there,
but still I can't continue parsing after a reconnect and I get a 
ConnectionTimeoutException from htmlparser.
It seems, that there might still be some kind of cache.
Could anybody tell me, how I can get the new instance of Parser to 
connect after a reconnect.

Thank you.
Hans.

[Htmlparser-user] Reminder: geeraza wants to add you to his friends list on Netlog

From: geeraza <nor...@ne...> - 2010-07-27 16:41:37

Hi,

Since Tuesday 20 July 2010, you have been invited by 1 of your contacts to join Netlog, 
the social community for over 49 million young people.

[---- Invitation from geeraza ---- ]
34 yrs - male - Baden-Wurttemberg (Germany)
Connect with geeraza:
http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0xJmdtPTE2JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNnIlM0R0MQ__
	

On Netlog you can:

- Create your own web page
- Extend your social network
- Publish your music playlists- Share pictures and videos- Post blogs
- And much more ... ....

http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0xJmdtPTE2JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNnIlM0R0OTg0MDgyMjYyMQ__

----------------------------------------------------------------
Don't want to receive invitations from your friends anymore?
http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0zJmdtPTE2JnU9JTJGZ28lMkZub21haWxzJTJGaW52aXRlJTJGZW1haWwlM0QtYUhSdGJIQmhjbk5sY2kxMWMyVnlRR3hwYzNSekxuTnZkWEpqWldadmNtZGxMbTVsZEFfXyUyNmNvZGUlM0QxMTc1MjQxNSUyNmlkJTNEMjEwNTI2NTY0NQ__



----------------------------------------------------------------
Netlog NV/SA. E. Braunplein 18. B-9000 Gent. Belgium BE0859635972. abu...@ne...

[Htmlparser-user] Visit my Netlog profile

From: geeraza <nor...@ne...> - 2010-07-20 15:46:13

Hey,

I have created a Netlog profile with my pictures, videos, blogs and events and I want to add you as a friend so you can see it. You first need to register on Netlog! When you log in, you can create your own profile.

Take a look:
http://en.netlog.com/go/mailurl/-bT05Njg2NzYwNDMmbD0xJmdtPTM3JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNmklM0R0OTE_

Cheers,
geeraza

----------------------------------------------------------------
Don't want to receive invitations from your friends anymore?
http://en.netlog.com/go/mailurl/-bT05Njg2NzYwNDMmbD0yJmdtPTM3JnU9JTJGZ28lMkZub21haWxzJTJGaW52aXRlJTJGZW1haWwlM0QtYUhSdGJIQmhjbk5sY2kxMWMyVnlRR3hwYzNSekxuTnZkWEpqWldadmNtZGxMbTVsZEFfXyUyNmNvZGUlM0QxMTc1MjQxNSUyNmlkJTNEMjEwNTI2NTY0NSUyNmklM0R0OTI_

Re: [Htmlparser-user] HTML parser parsing script incorrectly

From: Derrick O. <der...@gm...> - 2010-07-08 04:38:36

Did you set STRICT false:

http://htmlparser.sourceforge.net/javadoc/org/htmlparser/scanners/ScriptScanner.html



On Wed, Jul 7, 2010 at 9:48 PM, Niket Arora <nik...@ex...>wrote:

>  I m parsing a page
> http://www.healthline.com/search?q1=how+to+improve+prostate+blood+levels using
> htmlparser api and I m getting content inside a script tag in some other tag
> and reason for this is html tags are present in a string inside javascript
> tags and are not escaped …. so htmlparser api is closing on those tags.
>
>
>
>
>
>
> ================================================================================================================================================================================================
>
>
>
> <div id="myHealthlineHeader">
>
>         <script>
>
>               if(isLoggedIn()) {
>
>                 document.write("<a href=\"/action/LogOutServlet\">Sign
> Off</a> | <a rel=\"nofollow\" href=\"/myhealthline/account_overview.jsp\">My
> Healthline</a> | Welcome, <strong>" + getNickname() + "</strong>");
>
>                 document.getElementById("myHealthlineHeader").className =
> "hl_state_top_signed_in";
>
>               } else {
>
>
>
>                 document.write("<div
> style=\"float:right;text-align:right;padding:0 5px 0
> 0;\">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;<a class=\"underlineless\"
> rel=\"nofollow\"
> href=\"/yourfeedback.jsp?url=%2Fsearch%3Fq1%3Dhow%2Bto%2Bimprove%2Bprostate%2Bblood%2Blevels\">Feedback</a></div>");
>
>                 document.write("<div style=\"float:right\"><a
> class=\"underlineless\" rel=\"nofollow\" href=\"/signin.jsp\">Sign
> in</a>&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;<a class=\"underlineless\"
> rel=\"nofollow\" href=\"/registration.jsp\">Join Now</a>&nbsp;</div>")
>
>                 document.getElementById("myHealthlineHeader").className =
> "hl_state_top";
>
>               }
>
>         </script>
>
> </div>
>
>
>
>
> ================================================================================================================================================================================================
>
>
>
> Is there anyway to fix this issue?
>
>
>
> Regards
>
> Niket
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Sprint
> What will you do first with EVO, the first 4G phone?
> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

[Htmlparser-user] HTML parser parsing script incorrectly

From: Niket A. <nik...@ex...> - 2010-07-07 20:07:05

I m parsing a page http://www.healthline.com/search?q1=how+to+improve+prostate+blood+levels using htmlparser api and I m getting content inside a script tag in some other tag and reason for this is html tags are present in a string inside javascript tags and are not escaped .... so htmlparser api is closing on those tags.


================================================================================================================================================================================================

<div id="myHealthlineHeader">
        <script>
              if(isLoggedIn()) {
                document.write("<a href=\"/action/LogOutServlet\">Sign Off</a> | <a rel=\"nofollow\" href=\"/myhealthline/account_overview.jsp\">My Healthline</a> | Welcome, <strong>" + getNickname() + "</strong>");
                document.getElementById("myHealthlineHeader").className = "hl_state_top_signed_in";
              } else {

                document.write("<div style=\"float:right;text-align:right;padding:0 5px 0 0;\">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;<a class=\"underlineless\" rel=\"nofollow\" href=\"/yourfeedback.jsp?url=%2Fsearch%3Fq1%3Dhow%2Bto%2Bimprove%2Bprostate%2Bblood%2Blevels\">Feedback</a></div>");
                document.write("<div style=\"float:right\"><a class=\"underlineless\" rel=\"nofollow\" href=\"/signin.jsp\">Sign in</a>&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;<a class=\"underlineless\" rel=\"nofollow\" href=\"/registration.jsp\">Join Now</a>&nbsp;</div>")
                document.getElementById("myHealthlineHeader").className = "hl_state_top";
              }
        </script>
</div>

================================================================================================================================================================================================

Is there anyway to fix this issue?

Regards
Niket

[Htmlparser-user] Extract HTML Body and output as (X)HTML standards

From: Oliver S. <oli...@gm...> - 2010-07-05 16:31:02

Hi,

I need to read arbitrary HTML (HTML 4 transitional, XHTML 1.0 strict, ...) extract the body as a fragment and output it again as another (XHTML standard).

Reading the file is simple enough:

		Parser p = new Parser(resource);
		NodeFilter f = new NodeClassFilter(BodyTag.class);
		NodeList listOfBodies = p.extractAllNodesThatMatch(f);
		Node firstBody = listOfBodies.elementAt(0);
		NodeList bodyChildren = firstBody.getChildren();
		System.out.println(bodyChildren.toHtml());

From this hpw can I output either valid HTML 4.0 code or valid XHTML 1.0 code?

Best regards
Oliver

790 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 10 11 12 13 14 .. 99 > >> (Page 12 of 99)