htmlparser-developer Mailing List for HTML Parser (Page 32)

Brought to you by: derrickoswald

htmlparser-developer — The developer mailing list of the htmlparser project

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (4)	Nov (1)	Dec (4)
2002	Jan (12)	Feb	Mar (7)	Apr (27)	May (14)	Jun (16)	Jul (27)	Aug (74)	Sep (1)	Oct (23)	Nov (12)	Dec (119)
2003	Jan (31)	Feb (23)	Mar (28)	Apr (59)	May (119)	Jun (10)	Jul (3)	Aug (17)	Sep (8)	Oct (38)	Nov (6)	Dec (1)
2004	Jan (4)	Feb (4)	Mar (1)	Apr (2)	May	Jun (7)	Jul (6)	Aug (1)	Sep	Oct	Nov	Dec
2005	Jan	Feb (1)	Mar	Apr (8)	May	Jun	Jul	Aug (2)	Sep (10)	Oct (4)	Nov (15)	Dec
2006	Jan	Feb (1)	Mar	Apr (4)	May (11)	Jun	Jul	Aug	Sep (2)	Oct	Nov	Dec
2007	Jan (3)	Feb (2)	Mar	Apr (2)	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep (5)	Oct (1)	Nov	Dec
2009	Jan	Feb (1)	Mar	Apr (2)	May	Jun (4)	Jul	Aug (1)	Sep	Oct	Nov	Dec (2)
2010	Jan (1)	Feb	Mar	Apr (8)	May	Jun	Jul	Aug	Sep (6)	Oct	Nov (1)	Dec
2011	Jan	Feb	Mar	Apr	May (3)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2014	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr (1)	May	Jun (1)	Jul	Aug	Sep	Oct	Nov (2)	Dec (1)
2016	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug	Sep	Oct	Nov (2)	Dec (2)

Flat | Threaded

<< < 1 .. 30 31 32 33 > >> (Page 32 of 33)

[Htmlparser-developer] Dynamic page parsing bug fixed (ready for release 1.1)

From: Somik R. <so...@ya...> - 2002-04-05 07:14:43

Hi Folks,
    The dynamic page parsing bug is fixed, and as far as I've tested, I =
am able to parse correctly pages like =
http://search.yahoo.com/bin/search?p=3Ddogs=20
which Mats had posted earlier.

    We are now ready for release 1.1. I'd be grateful if I had some help =
in testing the parser - and see if there are any showstopper bugs for =
this release. (Get the latest code from CVS)

Regards,
Somik

Re: [Htmlparser-developer] Re: Htmlparser-developer digest, Vol 1 #38 - 1 msg

From: Somik R. <so...@ya...> - 2002-04-05 03:11:21

> I have used parser available in JDK.
> If u say I can send u example.
Yes Asgher, pls go ahead.

Regards,
Somik


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

[Htmlparser-developer] Bug fixed in HTMLTag - report from Raj Sharma

From: Somik R. <so...@ya...> - 2002-04-05 03:04:56

Hi Folks,
    An important bug has been pointed out by Raj Sharma, which would =
halt the parser if a page contained a link spread over two lines. This =
was a bug in HTMLTag, and I was able to find it quickly, thanks to the =
refactoring done earlier with the help of Arnaud.
    Also - HTMLLinkScanner and HTMLImageScanner have some small changes =
in connection with the fix.
    Please get the latest code from CVS.
   =20
Regards,
Somik
   =20

[Htmlparser-developer] Re: [Htmlparser-user] extracting links

From: Somik R. <so...@ya...> - 2002-04-04 15:51:32

>How come when you use the parser on most sites to extract links it works
>fine but when you use it on search engine i.e.
>http://search.yahoo.com/bin/search?p=dogs which is a page with search
>results for dogs, it does not work?

Ah - this is a known bug. It doesent work bcos the parser is not capable of
handling dynamic pages. This is actually not a difficult bug to fix. Version
1.10 of HTMLParser (the next release - coming soon) will contain this and
other fixes.

So you will have to wait till this weekend, or make the fix yourself - the
bug probably lies in HTMLParser.java itself, in the way a page extension is
handled.

Regards
Somik


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

Re: [Htmlparser-developer] Parser of JDK

From: Somik R. <so...@ya...> - 2002-04-04 02:22:15

Hi Asgher,
> I have used parser available in JDK.
> If u say I can send u example.

    Yes, pls go ahead. I dont have much time till the weekend, and it would
really help me get up to speed with some help.
Regards,
Somik


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

[Htmlparser-developer] Re: Htmlparser-developer digest, Vol 1 #38 - 1 msg

From: Asgher A. <as...@lw...> - 2002-04-02 05:02:54

I have used parser available in JDK.
If u say I can send u example.
On Monday, April 01, 2002 at 12:47:41 PM, htm...@li... wrote:

> Send Htmlparser-developer mailing list submissions to
> 	htm...@li...
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
> or, via email, send a message with subject or body 'help' to
> 	htm...@li...
> 
> You can reach the person managing the list at
> 	htm...@li...
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Htmlparser-developer digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: [Htmlparser-user] Swing integration (Somik Raha)
> 
> --__--__--
> 
> Message: 1
> From: "Somik Raha" <so...@ya...>
> To: "HTMLParser User List" <htm...@li...>
> Cc: "HTMLParser Developer List" <htm...@li...>
> Date: Tue, 2 Apr 2002 00:28:28 +0900
> Subject: [Htmlparser-developer] Re: [Htmlparser-user] Swing integration
> 
> Hi Craig
>     Wow! Thats a great question.
>     Actually, I doubt if I could replace Sun Microsystems' code with mine. I
> dont think Java is that open (or is it ?)
> However, we could think of writing our own adapter for the html parser that
> might plugin in some way...
>      I have never used Sun's html parser (If I had, I might not have started
> this project).
>      I will need to study Sun's parser before I can answer your question..
>     But there does seem to be some interesting possibilities.
> 
> Regards
> Somik
> ----- Original Message -----
> From: "Craig Raw" <cr...@qu...>
> To: <htm...@li...>
> Sent: Monday, April 01, 2002 10:20 PM
> Subject: [Htmlparser-user] Swing integration
> 
> 
> > Has the HTML Parser been integrated into Swing's HTMLEditorKit to
> > provide a better implementation of JEditorPane's HTML viewing
> > capabilities? HTML Parser would need to replace
> > javax.swing.text.html.parser.Parser, which is currently somewhat buggy.
> > Anyone tried this?
> >
> > -craig
> >
> >
> >
> >
> > _______________________________________________
> > Htmlparser-user mailing list
> > Htm...@li...
> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> 
> 
> _________________________________________________________
> Do You Yahoo!?
> Get your free @yahoo.com address at http://mail.yahoo.com
> 
> 
> 
> 
> --__--__--
> 
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
> 
> 
> End of Htmlparser-developer Digest
> 
> 

Asgher Ali
e-mail: as...@lw...


---------------------------------------------
Lahore Wide Web        "The Intranet Company"

http://www.lww.org/

[Htmlparser-developer] Re: [Htmlparser-user] Swing integration

From: Somik R. <so...@ya...> - 2002-04-01 15:22:00

Hi Craig
    Wow! Thats a great question.
    Actually, I doubt if I could replace Sun Microsystems' code with mine. I
dont think Java is that open (or is it ?)
However, we could think of writing our own adapter for the html parser that
might plugin in some way...
     I have never used Sun's html parser (If I had, I might not have started
this project).
     I will need to study Sun's parser before I can answer your question..
    But there does seem to be some interesting possibilities.

Regards
Somik
----- Original Message -----
From: "Craig Raw" <cr...@qu...>
To: <htm...@li...>
Sent: Monday, April 01, 2002 10:20 PM
Subject: [Htmlparser-user] Swing integration


> Has the HTML Parser been integrated into Swing's HTMLEditorKit to
> provide a better implementation of JEditorPane's HTML viewing
> capabilities? HTML Parser would need to replace
> javax.swing.text.html.parser.Parser, which is currently somewhat buggy.
> Anyone tried this?
>
> -craig
>
>
>
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

[Htmlparser-developer] Fixed major bug

From: Somik R. <so...@ya...> - 2002-03-31 09:19:42

Hi Folks,
     A major bug fix has been done. I had previously reported that the =
parser crashes when encountering very dirty html of the form :
<A HREF=3D"http://www.somelink.com">SomeText<A>

Instead of the end tag, we put in a begin tag by mistake, and the parser =
promptly crashes. This called for a modification in the evaluate() =
method, as the current scanners dont have more than existing local info =
about the parsing process. But now, Ive introduced a parameter - which =
takes in the scanner. So, if a tag was being parsed, and in the process =
of the parsing, another tag starts being parsed, then the second tag =
will now know that a scanner process is already running.

This enables the HTMLLinkScanner to come to the conclusion that its =
current parsing activity is of a dirty html tag, and hence take the =
appropriate action (flag the scanner into a dirty mode, and return an =
HTMLEndTag - which is expected by the previous scanner).

This solves this bug - and finally we can handle some really crazy =
pages...
This fix and some others, along with some additions (META and TITLE) =
will make it to release 1.1 (coming soon). Currently, the latest code is =
available thru CVS.

In case any of you have written your own scanners - you will need to =
modify the evaluate method signature to be compatible with the new =
HTMLTagScanner.

Regards,
Somik

[Htmlparser-developer] New Features

From: Somik R. <so...@ya...> - 2002-03-24 05:51:01

Dear Users,
    Thanks for using HTMLParser. HTMLParser is getting some new =
features, namely,=20
[1] HTMLMetaTag scanner
[2] Support for not ".html" pages - I am planning to bring in dynamic =
pages under the purview of the parser as well. Though I might need a bit =
of help for this.

I wanted to have some feedback from the user community  -what are the =
features that you would really like to see added to the parser (or r u =
quite happy with the parser as is?)

Regards,
Somik

[Htmlparser-developer] Issue with dirty parsing

From: Somik R. <so...@ya...> - 2002-03-24 05:48:24

Hi Folks,
    I am encountering a really strange scenario - try to create a link =
like this in a web page -
<A HREF=3D"...">something<A>

i.e. instead of putting a close tag </A>, put an open tag. I find that =
Internet Explorer renders it just fine. Now if IE renders it, then =
perhaps we ought to support it in HTML Parser. However, its not so easy =
-

check out the latest source from CVS - I have put in a testcase for this =
situation which is failing (in HTMLLinkScannerTest - =
com.kizna.html.scannersTests)

The problem is in HTMLReader.find() - which goes into a sort of =
recursion - when it finds <A ...> the first time, the scanner asks it to =
find the remaining tags. Now if the second A is encountered, it will try =
to keep parsing till the end tag is encountered, which wont happen. Now, =
I need a clean elegant way of telling the reader not to expand in =
exceptional situations like this one.

I can of course do it with some flags - but before I do it - I was =
wondering if anyone has insights on this problem - and if anyone thinks =
we should not support this dirty html even if IE does.

Regards,
Somik

[Htmlparser-developer] Release 1.04 is out

From: Somik R. <so...@ya...> - 2002-03-22 16:40:58

Hi Folks,
    Release 1.04 is out. Has the following bug fixes :
[1] Parsing JSP tags which had tags within inverted commas, was causing =
problems.
[2] A link with no link url would cause the parser to crash with a null =
pointer exception.

The above bugs were reported by Gordon Deudney and Robert Kausch.

More test cases added.=20

Regards,
Somik

Re: [Htmlparser-developer] HTMLParser Sample App

From: Somik R. <so...@ya...> - 2002-03-12 08:52:14

Attachments: TableParser.java

Hi Don,
    It will be appreciated if you can post usage doubts in the
htmlparser-user mailing list (link is at http://htmlparser.sourceforge.net).
    To your query - the code you posted seems rather complex to do a not so
complex task :)

    Here's how you would do it in HTML Parser (in the attached code). The
code I have given is the shortcut-way. There is a way to get much shorter
code that what I am providing you, but that requires getting into the design
docs of the parser - and writing a Table Scanner. Then your code could
become some this like this :

HTMLParser parser = new HTMLParser("http://www.nba.com");
HTMLNode node;
int tableCount = 0;
for (Enumeration e = parser.elements();e.hasMoreElements();) {
    node = (HTMLNode) e.nextElement();
    if (node instanceof HTMLTableNode) {
         tableCount ++;
        if (tableCount==4) {
            HTMLTableNode tableNode = (HTMLTableNode)node;
            tableNode.print();
        }
    }
}

Regards,
Somik

----- Original Message -----
From: "Don Taggart" <dta...@e-...>
To: <Htm...@li...>
Sent: Tuesday, March 12, 2002 1:33 AM
Subject: [Htmlparser-developer] HTMLParser Sample App


> Hi,
> I am attempting to grab the content of a certain table on any website. For
> instance I'd like to get all of the text, tags, comments, etc contained in
> the 4rth table I run across. I've been able to do this successfully using
> the htmleditorkit in swing, but it has a few bugs.
>
> Would your HTML Parser be useful for this scenario, and If so, could you
> give me some guidance on how to start.
>
> Thanks,
> Don
>
>
> Heres my code that goes and get the contents of the 4rth table at nba.com
>
> import java.io.*;
> import java.net.*;
> import java.util.*;
> import javax.swing.text.*;
> import javax.swing.text.html.*;
> import javax.swing.text.html.parser.*;
>
> /**
>  * This small demo program shows how to use the
>  * HTMLEditorKit.Parser and its implementing class
>  * ParserDelegator in the Swing system.
>  */
>
> public class HtmlParseDemo2 {
>     public static void main(String [] args) {
>         Reader r;
>         String host = "";
>         String spec = "http://www.nba.com";
>        long endTime;
>        long endTime2;
>        long startTime = System.currentTimeMillis();
>        String snippet = "";
>
>
>         try {
>             if (spec.indexOf("://") > 0) {
>                 URL u = new URL(spec);
>                 host = u.getHost();
>                 Object content = u.getContent();
>
>                 if (content instanceof InputStream) {
>
>                     r = new InputStreamReader((InputStream)content);
>                 }
>                 else if (content instanceof Reader) {
>                     r = (Reader)content;
>                 }
>                 else {
>                     throw new Exception("Bad URL content type.");
>                 }
>             }
>             else {
>                 r = new FileReader(spec);
>             }
>
> endTime = System.currentTimeMillis();
>             System.out.println("Time to complete connection: " +
(endTime -
> startTime));
>
>             HTMLEditorKit.Parser parser;
>             System.out.println("About to parse " + spec);
>             parser = new ParserDelegator();
>
>             HTMLParseLister2 snippetCallback = new HTMLParseLister2(host);
>
>             file://Parse Away!
>             parser.parse(r, snippetCallback, true);
>             r.close();
>
>
>             endTime2 = System.currentTimeMillis();
>             System.out.println("Time to complete: " + (endTime2 -
> startTime));
>         }
>         catch (Exception e) {
>             System.err.println("Error: " + e);
>             e.printStackTrace(System.err);
>         }
>     }
> }
>
> /**
>  * HTML parsing proceeds by calling a callback for
>  * each and every piece of the HTML document.  This
>  * simple callback class simply prints an indented
>  * structural listing of the HTML data.
>  */
> class HTMLParseLister2 extends HTMLEditorKit.ParserCallback
> {
>
>
>
>    int indentSize = 0;
>    int tableNum = 0;
>     String atts;
>     String tabNum;
>     String endTable;
>     String tableLevel;
>     Stack tableStack = new Stack();
>    boolean finished = false;
>     HTML.Tag selectedTag = HTML.Tag.TABLE;
>     String selectedTable = Integer.toString(4);
>    boolean inImportantTag = false;
>    StringBuffer snippetString = new StringBuffer();
>
>
>
>    private String host;
>
>
>
>    public HTMLParseLister2(String host) {
>     this.host = host;
>     }
>
>     public String  getSnippet() {
> return snippetString.toString();
> }
>
>     protected void indent() {
>         indentSize += 4;
>     }
>
>     protected void unIndent() {
>         indentSize -= 4; if (indentSize < 0) indentSize = 0;
>     }
>
>     protected void pIndent() {
>         for(int i = 0; i < indentSize; i++) System.out.print(" ");
>     }
>
>     public void handleText(char[] data, int pos) {
>        if (!tableStack.empty() && !finished)
>        {
>        tableLevel = (String)tableStack.peek();
>         if (Integer.parseInt(tableLevel) >=
> (Integer.parseInt(selectedTable)))
>        {
>         file://pIndent();
>         String str = new String(data);
>        System.out.println(str);
>         }
>        }
>
>        if (inImportantTag)
>     {
>     String str = new String(data);
>         System.out.println(str);
>     }
>     }
>
> // ********************************************************
>     public void handleComment(char[] data, int pos) {
>
>     if (!tableStack.empty() && !finished)
>     {
>     tableLevel = (String)tableStack.peek();
>         if (Integer.parseInt(tableLevel) >=
> (Integer.parseInt(selectedTable)))
>     {
>         file://pIndent();
>         String str = new String(data);
>         file://System.out.println("<!--" + str + "-->");
>         file://indent();
>         file://pIndent();
>     }
>     }
>
>     if (inImportantTag)
>     {
>     String str = new String(data);
>         System.out.println("<!--" + str + "-->");
>     }
>
>     }
> // ********************************************************
>
> // ********************************************************
>     public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
{
>     // Is this Tag One of the few that we want to list outside the chosen
> component
>     if (t == HTML.Tag.STYLE || t == HTML.Tag.LINK)
>     {
>     atts = listAttributes(a);
>     inImportantTag = true;
>     System.out.print("<" + t.toString() + " " + atts + ">");
>     return;
>     }
>
>        if (t == selectedTag && !finished)
>        {
>
>      file://pIndent();
>      tableNum++;
>         tabNum = Integer.toString(tableNum);
>         tableStack.push(tabNum);
>         atts = listAttributes(a);
>         tableLevel = (String)tableStack.peek();
>         if (Integer.parseInt(tableLevel) >=
> (Integer.parseInt(selectedTable)))
>         {
>         file://System.out.println("<Table#" + tableLevel + ">");
>
>         }
>        }
>
>        if (!tableStack.empty() && !finished) {
>        tableLevel = (String)tableStack.peek();
>        if (Integer.parseInt(tableLevel) >=
> (Integer.parseInt(selectedTable)))
>        {
>        atts = listAttributes(a);
>         System.out.println("<" + t.toString() + " " + atts + ">");
>         }
>        }
>     }
>     // ********************************************************
>
>
> // ********************************************************
>     public void handleEndTag(HTML.Tag t, int pos) {
>     if (inImportantTag)
>     {
>     inImportantTag = false;
>     System.out.println("</" + t.toString() + ">");
>     }
>
>     if (!tableStack.empty() && !finished)
>     {
>        if (t == selectedTag)
>        {
>         file://unIndent();
>         file://pIndent();
>         tableLevel = (String)tableStack.peek();
>        if (Integer.parseInt(tableLevel) >=
> (Integer.parseInt(selectedTable))){
>         System.out.println("</" + t.toString() + ">");
>         }
>         if (tableStack.peek().equals(selectedTable))
>         finished = true;
>         endTable = (String) tableStack.pop();
>         }
>     }
>        if (!tableStack.empty() && !finished) {
>        tableLevel = (String)tableStack.peek();
>        if (Integer.parseInt(tableLevel) >=
> (Integer.parseInt(selectedTable)) && t != selectedTag) {
>        file://pIndent();
>         System.out.println("</" + t.toString() + ">");
>         file://pIndent();
>         }
>        }
>     }
> // ********************************************************
>
>
>
> // ********************************************************
>     public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int
pos)
> {
>
>
>
>
>     if (t == HTML.Tag.LINK && !finished)
>     {
>     atts = listAttributes(a);
>     System.out.println("<" + t.toString() + " " + atts + ">");
>     }
>
>     if (!tableStack.empty() && !finished)
>     {
>
>
>     atts = listAttributes(a);
>     if(a.getAttribute(HTML.Attribute.ENDTAG) != null)
>     {
>     handleEndTag(t, pos);
>     return;
>     }
>     file://if (tableStack.peek() == selectedTable)
>         file://pIndent();
>
>         tableLevel = (String)tableStack.peek();
>         if (Integer.parseInt(tableLevel) >=
> (Integer.parseInt(selectedTable)))
>         System.out.println("<" + t.toString() + " " + atts + ">");
>     }
>     }
> // ********************************************************
>
>
>
>
> // ********************************************************
> private String listAttributes(AttributeSet attributes) {
>     Enumeration e = attributes.getAttributeNames();
>     String attString = "";
>
>     while (e.hasMoreElements()) {
>       Object name = e.nextElement();
>       Object value = attributes.getAttribute(name);
>
>       if (name.toString().equals("href") || name.toString().equals("src")
> || name.toString().equals("action"))
>       {
>       if (value.toString().charAt(0) == '/')
>       value = host + value;
>       }
>       attString = attString + name + "=\"" + value + "\" ";
>
>     }
>     return attString;
>   }
> // ********************************************************
>
> // ********************************************************
>     public void handleError(String errorMsg, int pos){
>         file://System.out.println("Parsing error: " + errorMsg + " at " +
pos);
>     }
> }
>
>
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

[Htmlparser-developer] HTMLParser Sample App

From: Don T. <dta...@e-...> - 2002-03-11 16:37:08

Hi,
	I am attempting to grab the content of a certain table on any website. For
instance I'd like to get all of the text, tags, comments, etc contained in
the 4rth table I run across. I've been able to do this successfully using
the htmleditorkit in swing, but it has a few bugs.

Would your HTML Parser be useful for this scenario, and If so, could you
give me some guidance on how to start.

Thanks,
	Don


Heres my code that goes and get the contents of the 4rth table at nba.com

import java.io.*;
import java.net.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

/**
 * This small demo program shows how to use the
 * HTMLEditorKit.Parser and its implementing class
 * ParserDelegator in the Swing system.
 */

public class HtmlParseDemo2 {
    public static void main(String [] args) {
        Reader r;
        String host = "";
        String spec = "http://www.nba.com";
       long endTime;
       long endTime2;
       long startTime = System.currentTimeMillis();
       	String snippet = "";


        try {
            if (spec.indexOf("://") > 0) {
                URL u = new URL(spec);
                host = u.getHost();
                Object content = u.getContent();

                if (content instanceof InputStream) {

                    r = new InputStreamReader((InputStream)content);
                }
                else if (content instanceof Reader) {
                    r = (Reader)content;
                }
                else {
                    throw new Exception("Bad URL content type.");
                }
            }
            else {
                r = new FileReader(spec);
            }

			endTime = System.currentTimeMillis();
            System.out.println("Time to complete connection: " + (endTime -
startTime));

            HTMLEditorKit.Parser parser;
            System.out.println("About to parse " + spec);
            parser = new ParserDelegator();

            HTMLParseLister2 snippetCallback = new HTMLParseLister2(host);

            //Parse Away!
            parser.parse(r, snippetCallback, true);
            r.close();


            endTime2 = System.currentTimeMillis();
            System.out.println("Time to complete: " + (endTime2 -
startTime));
        }
        catch (Exception e) {
            System.err.println("Error: " + e);
            e.printStackTrace(System.err);
        }
    }
}

/**
 * HTML parsing proceeds by calling a callback for
 * each and every piece of the HTML document.  This
 * simple callback class simply prints an indented
 * structural listing of the HTML data.
 */
class HTMLParseLister2 extends HTMLEditorKit.ParserCallback
{



   int indentSize = 0;
   int tableNum = 0;
    String atts;
    String tabNum;
    String endTable;
    String tableLevel;
    Stack tableStack = new Stack();
   boolean finished = false;
    HTML.Tag selectedTag = HTML.Tag.TABLE;
    String selectedTable = Integer.toString(4);
   boolean inImportantTag = false;
   StringBuffer snippetString = new StringBuffer();



   private String host;



   public HTMLParseLister2(String host) {
    this.host = host;
    }

    public String  getSnippet() {
		return snippetString.toString();
	}

    protected void indent() {
        indentSize += 4;
    }

    protected void unIndent() {
        indentSize -= 4; if (indentSize < 0) indentSize = 0;
    }

    protected void pIndent() {
        for(int i = 0; i < indentSize; i++) System.out.print(" ");
    }

    public void handleText(char[] data, int pos) {
       if (!tableStack.empty() && !finished)
       		{
       	tableLevel = (String)tableStack.peek();
        	if (Integer.parseInt(tableLevel) >=
(Integer.parseInt(selectedTable)))
       			{
        		//pIndent();
        		String str = new String(data);
       			System.out.println(str);
        		}
       		}

       if (inImportantTag)
    		{
    		String str = new String(data);
        	System.out.println(str);
    		}
    }

	// ********************************************************
    public void handleComment(char[] data, int pos) {

    	if (!tableStack.empty() && !finished)
    		{
    		tableLevel = (String)tableStack.peek();
        	if (Integer.parseInt(tableLevel) >=
(Integer.parseInt(selectedTable)))
    			{
        		//pIndent();
        		String str = new String(data);
        		//System.out.println("<!--" + str + "-->");
        		//indent();
        		//pIndent();
    			}
    		}

    	if (inImportantTag)
    		{
    		String str = new String(data);
        	System.out.println("<!--" + str + "-->");
    		}

    }
	// ********************************************************

	// ********************************************************
    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
    	// Is this Tag One of the few that we want to list outside the chosen
component
    	if (t == HTML.Tag.STYLE || t == HTML.Tag.LINK)
    		{
    		atts = listAttributes(a);
    		inImportantTag = true;
    		System.out.print("<" + t.toString() + " " + atts + ">");
    		return;
    		}

       if (t == selectedTag && !finished)
       		{

     		//pIndent();
     		tableNum++;
        	tabNum = Integer.toString(tableNum);
        	tableStack.push(tabNum);
        	atts = listAttributes(a);
        	tableLevel = (String)tableStack.peek();
        	if (Integer.parseInt(tableLevel) >=
(Integer.parseInt(selectedTable)))
        		{
        			//System.out.println("<Table#" + tableLevel + ">");

        		}
       		}

       if (!tableStack.empty() && !finished) {
       		tableLevel = (String)tableStack.peek();
       	if (Integer.parseInt(tableLevel) >=
(Integer.parseInt(selectedTable)))
       			{
       			atts = listAttributes(a);
        		System.out.println("<" + t.toString() + " " + atts + ">");
        		}
       }
    }
    // ********************************************************


	// ********************************************************
    public void handleEndTag(HTML.Tag t, int pos) {
    	if (inImportantTag)
    		{
    		inImportantTag = false;
    		System.out.println("</" + t.toString() + ">");
    		}

    	if (!tableStack.empty() && !finished)
    		{
       		if (t == selectedTag)
       			{
        			//unIndent();
        			//pIndent();
        		tableLevel = (String)tableStack.peek();
       		if (Integer.parseInt(tableLevel) >=
(Integer.parseInt(selectedTable))){
        			System.out.println("</" + t.toString() + ">");
        			}
        		if (tableStack.peek().equals(selectedTable))
        			finished = true;
        		endTable = (String) tableStack.pop();
        		}
    		}
       if (!tableStack.empty() && !finished) {
       	tableLevel = (String)tableStack.peek();
       	if (Integer.parseInt(tableLevel) >=
(Integer.parseInt(selectedTable)) && t != selectedTag) {
       			//pIndent();
        		System.out.println("</" + t.toString() + ">");
        		//pIndent();
        	}
       }
    }
	// ********************************************************



	// ********************************************************
    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos)
{




    	if (t == HTML.Tag.LINK && !finished)
    		{
    		atts = listAttributes(a);
    		System.out.println("<" + t.toString() + " " + atts + ">");
    		}

    	if (!tableStack.empty() && !finished)
    		{


    		atts = listAttributes(a);
    		if(a.getAttribute(HTML.Attribute.ENDTAG) != null)
    			{
    			handleEndTag(t, pos);
    			return;
    			}
    		//if (tableStack.peek() == selectedTable)
        		//pIndent();

        	tableLevel = (String)tableStack.peek();
        	if (Integer.parseInt(tableLevel) >=
(Integer.parseInt(selectedTable)))
        		System.out.println("<" + t.toString() + " " + atts + ">");
    		}
    }
	// ********************************************************




	// ********************************************************
	private String listAttributes(AttributeSet attributes) {
    	Enumeration e = attributes.getAttributeNames();
    	String attString = "";

    	while (e.hasMoreElements()) {
      		Object name = e.nextElement();
      		Object value = attributes.getAttribute(name);

      		if (name.toString().equals("href") || name.toString().equals("src")
|| name.toString().equals("action"))
      			{
      			if (value.toString().charAt(0) == '/')
      				value = host + value;
      			}
      		attString = attString + name + "=\"" + value + "\" ";

    	}
    	return attString;
  	}
	// ********************************************************

	// ********************************************************
    public void handleError(String errorMsg, int pos){
        //System.out.println("Parsing error: " + errorMsg + " at " + pos);
    }
}

[Htmlparser-developer] HTMLParser 1.03 is out

From: Somik R. <so...@ya...> - 2002-03-04 14:28:27

HTMLParser 1.03 has been released. It contains a bug fix in =
HTMLRemarkNode which was causing the parser to crash on pages with =
remarks going over one line. A test case for the bug has been added in =
HTMLRemarkNodeTest.=20

The release also contains the design documentation in the zip. Thanks to =
Serge Kruppa for pointing out the bug.

Regards
Somik

Re: [Htmlparser-developer] HTMLParser 1.02

From: Somik R. <so...@ya...> - 2002-01-18 23:55:06

> What is the Parse.jar file in htmlparser.jar?

Ah, i was wondering why the size was so much. Thanks for pointing it out.

> I would like if htmlparser.jar would be named to HTMLParser.jar
> according to the name of the application.
>
> I happened to call it with capital letters in my application
> and it's easy for me to make this change but perhaps
> if someone else does it he does not notice the difference.

Well, class naming conventions are different from jar naming conventions..
I thought keeping all small letters is simple.

> I today replaced my modified version 0.98
> with the official version 1.02 and after I solved some
> incompatibilities (mainly the BufferedReader thing)
> it seemed to go as it should.

Great!
Any suggestions on where we go from here ? It really bothers me that the
parser does not show up on google when I type "html parser java" in the
search. How do we go about giving it more visibility?

Cheers,
Somik


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

[Htmlparser-developer] HTMLParser 1.02

From: Kaarle K. <kaa...@ik...> - 2002-01-18 20:24:23

hi,

What is the Parse.jar file in htmlparser.jar?

I would like if htmlparser.jar would be named to HTMLParser.jar
according to the name of the application.

I happened to call it with capital letters in my application
and it's easy for me to make this change but perhaps
if someone else does it he does not notice the difference.

I today replaced my modified version 0.98
with the official version 1.02 and after I solved some
incompatibilities (mainly the BufferedReader thing)
it seemed to go as it should.

Kaarle

---------------------------------------------
Kaarle Kaila
http://www.iki.fi/kaila
mailto:kaa...@ik...
tel: +358 50 3725844

[Htmlparser-developer] Design Documentation added, Website overhauled

From: Somik R. <so...@ya...> - 2002-01-16 14:09:40

Hi Folks,
    Check http://htmlparser.sourceforge.net for a totally new look. =
Design documentation with sample programs has been added.
    Feedback is welcome.

Regards,
Somik

[Htmlparser-developer] Design Documentation added, Website overhauled

From: Somik R. <so...@ki...> - 2002-01-16 14:08:44

Hi Folks,
    Check http://htmlparser.sourceforge.net for a totally new look. =
Design documentation with sample programs has been added.
    Feedback is welcome.

Regards,
Somik

[Htmlparser-developer] Maintenance release v1.02

From: Somik R. <so...@ya...> - 2002-01-09 16:36:33

Hi Folks,
    Another bug was detected in HTMLStyleScanner, and has been =
immediately fixed. v1.02 has been released with this fix, and another =
one - which allows scanning of Finnish pages to proceed properly.

Regards,
Somik

Re: [Htmlparser-developer] htmlparser 1.0 (Issue with mtv3 is that of internationalization)

From: Somik R. <so...@ya...> - 2002-01-09 11:50:17

Dear Kaarle,
    Thank you very much! You are quite right, I forgot I was using =
Shift-JIS for Japanese encoding support and SJIS is a Microsoft specific =
standard - not unicode, but if I use a unicode encoding, it should be =
fine. I will try with UTF8, will need your help to co-ordinate some more =
tests.
    Meanwhile this style thing is proving to be a headache, just got a =
report that its crashing on google. Need to add more test cases..

Regards,
Somik

----- Original Message -----=20
  From: Kaarle Kaila=20
  To: Somik Raha=20
  Sent: Wednesday, January 09, 2002 2:40 AM
  Subject: Re: [Htmlparser-developer] htmlparser 1.0 (Issue with mtv3 is =
that of internationalization)


  At 22:37 8.1.2002 +0530, Somik Raha wrote:

    Hi Kaarle,
        I found the reason for the last problem - the site : =
http://www.mtv3.fi
    has a link in Finnish. That link is not being interpreted correctly =
by the
    parser. The link is :
    <a href=3D"/ks/ks_20020701b.shtml">Palveluun p=E4=E4set =
t=E4st=E4</a>


  hi Somik,

  HTMLParser reads lines from the net. It initiates the contact to that =
line with a command=20

  reader =3D new HTMLReader(new BufferedReader(new =
InputStreamReader(uc.getInputStream(),"SJIS")),resourceLocn);

  I don't know what SJIS stands for. The Java API does not list that, =
but lists among others ISO-8859-1.
  Check InputStreamReader constructor. By using ISO-8859-1 it does not =
hang like it did with SJIS!
  SJIS seems to make everything 7-bit ascii.=20

  reader =3D new HTMLReader(new BufferedReader(new =
InputStreamReader(uc.getInputStream(),"ISO-8859-1")),resourceLocn);

  With this setting at least finnish characters come correctly.=20
  I also downloaded two files you hade made changes from CVS=20
  and I could read www.mtv3.fi. It even reads my webpage (rather strange =
output though).

  In Japan I would expect the internationalizing to be an issue?? =
Wouldn't UNICODE=20
  be required there?

  regards
  Kaarle


    Whats happening is that the last < is being corrupted. I havent =
faced a
    problem with internationalization till now - and I am kind of stuck =
with
    this one. Maybe you'd be in a better position to solve it than me. I =
will
    make the release with the other bug fixed, and Id be grateful if u =
can
    proceed from there.

    Regards,
    Somik


    _________________________________________________________
    Do You Yahoo!?
    Get your free @yahoo.com address at http://mail.yahoo.com
  ---------------------------------------------
  Kaarle Kaila
  http://www.iki.fi/kaila
  mailto:kaa...@ik...
  tel: +358 50 3725844=20

[Htmlparser-developer] Maintenance v1.01 released

From: Somik R. <so...@ya...> - 2002-01-08 17:35:21

Hi Folks,
    An important bug fix has been done. The parser was crashing on style =
tags - this has been fixed.
Regards,
Somik

Re: [Htmlparser-developer] htmlparser 1.0

From: Somik R. <so...@ya...> - 2002-01-08 15:46:16

Hi Kaarle,
    To answer your basic question - crawler will crawl through a url (like
websnake and similar robot crawlers). It will pick up links and visit those
links and so on recursively depending on the depth you define.
    The bugs you see are not bcos of the crawler code, but bcos of some
parser bugs. The scanner bugs came in when I tried to fix the case when the
style tags are in one big line with other stuff. Obviously, not enough test
cases.
    Thankfully, you are htmlparser's best tester :)
    Your site and http://www.yle.fi are working fine now. mtv3 is giving the
wierd out of mem excpetion and I am now fixing that. As soon as thats done,
maintenance release 1.01 will be out.

Cheers,
Somik

----- Original Message -----
From: "Kaarle Kaila" <kaa...@ik...>
To: <htm...@li...>
Sent: Tuesday, January 08, 2002 3:34 AM
Subject: [Htmlparser-developer] htmlparser 1.0


> I tried the example applications using the bat-files
> with htmlparser 1.0 with not very good success.
>
> 1)
> runCrawler http://www.google.com 1
> This gives a list of links on the abovementioned page I assume
>
> 2) (finnish broadcastin company)
> runCrawler http://www.yle.fi 1
> This throws
> Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
> String ind
> ex out of range: 27
>
> 3) (finnish commercial tvstation )
> runCrawler http://www.mtv3.fi 1
> this throws
> Exception in thread "main" java.lang.OutOfMemoryError
>          <<no stack trace available>>
>
> 4) my own simple homepage
>
> After a rather long time throws:
> Crawling to
> http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p
> id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0
> Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
> String ind
> ex out of range: 23
>          at java.lang.String.substring(Unknown Source)
> ........
> I don't think I have such microsoft links on my page. Probably something
to
> to with the activeisp.com that provides me with diskspace??
>
> Similar result from my software page at www.kk-software.fi
> --------------------
> As a result of these experiments i did not understand what the Robot tries
> to do??
>
> Any explanations to this?
> regards
> Kaarle
>
> ---------------------------------------------
> Kaarle Kaila
> http://www.iki.fi/kaila
> mailto:kaa...@ik...
> tel: +358 50 3725844
>
>
>
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

Re: [Htmlparser-developer] htmlparser 1.0

From: Somik R. <so...@ya...> - 2002-01-08 15:16:32

Hi Kaarle,
    Thanks for pointing this out.
    Its not a bug with the crawler, but with the parser itself - in
HTMLStyleScanner...
    I am trying to fix it asap.
Regards,
Somik
----- Original Message -----
From: "Kaarle Kaila" <kaa...@ik...>
To: <htm...@li...>
Sent: Tuesday, January 08, 2002 3:34 AM
Subject: [Htmlparser-developer] htmlparser 1.0


> I tried the example applications using the bat-files
> with htmlparser 1.0 with not very good success.
>
> 1)
> runCrawler http://www.google.com 1
> This gives a list of links on the abovementioned page I assume
>
> 2) (finnish broadcastin company)
> runCrawler http://www.yle.fi 1
> This throws
> Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
> String ind
> ex out of range: 27
>
> 3) (finnish commercial tvstation )
> runCrawler http://www.mtv3.fi 1
> this throws
> Exception in thread "main" java.lang.OutOfMemoryError
>          <<no stack trace available>>
>
> 4) my own simple homepage
>
> After a rather long time throws:
> Crawling to
> http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p
> id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0
> Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
> String ind
> ex out of range: 23
>          at java.lang.String.substring(Unknown Source)
> ........
> I don't think I have such microsoft links on my page. Probably something
to
> to with the activeisp.com that provides me with diskspace??
>
> Similar result from my software page at www.kk-software.fi
> --------------------
> As a result of these experiments i did not understand what the Robot tries
> to do??
>
> Any explanations to this?
> regards
> Kaarle
>
> ---------------------------------------------
> Kaarle Kaila
> http://www.iki.fi/kaila
> mailto:kaa...@ik...
> tel: +358 50 3725844
>
>
>
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

[Htmlparser-developer] htmlparser 1.0

From: Kaarle K. <kaa...@ik...> - 2002-01-07 22:06:18

I tried the example applications using the bat-files
with htmlparser 1.0 with not very good success.

1)
runCrawler http://www.google.com 1
This gives a list of links on the abovementioned page I assume

2) (finnish broadcastin company)
runCrawler http://www.yle.fi 1
This throws
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: 
String ind
ex out of range: 27

3) (finnish commercial tvstation )
runCrawler http://www.mtv3.fi 1
this throws
Exception in thread "main" java.lang.OutOfMemoryError
         <<no stack trace available>>

4) my own simple homepage

After a rather long time throws:
Crawling to 
http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p
id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: 
String ind
ex out of range: 23
         at java.lang.String.substring(Unknown Source)
........
I don't think I have such microsoft links on my page. Probably something to
to with the activeisp.com that provides me with diskspace??

Similar result from my software page at www.kk-software.fi
--------------------
As a result of these experiments i did not understand what the Robot tries 
to do??

Any explanations to this?
regards
Kaarle

---------------------------------------------
Kaarle Kaila
http://www.iki.fi/kaila
mailto:kaa...@ik...
tel: +358 50 3725844

[Htmlparser-developer] Zip file corrupted, fixed now

From: Somik R. <so...@ya...> - 2002-01-05 17:11:17

Hi Folks,
    Sorry bout that, the zip file that was uploaded seemed to be =
corrupted. Its fixed, and you should be able to download it now.

Regards,
Somik

14 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 30 31 32 33 > >> (Page 32 of 33)