htmlparser-developer Mailing List for HTML Parser (Page 17)

Brought to you by: derrickoswald

htmlparser-developer — The developer mailing list of the htmlparser project

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (4)	Nov (1)	Dec (4)
2002	Jan (12)	Feb	Mar (7)	Apr (27)	May (14)	Jun (16)	Jul (27)	Aug (74)	Sep (1)	Oct (23)	Nov (12)	Dec (119)
2003	Jan (31)	Feb (23)	Mar (28)	Apr (59)	May (119)	Jun (10)	Jul (3)	Aug (17)	Sep (8)	Oct (38)	Nov (6)	Dec (1)
2004	Jan (4)	Feb (4)	Mar (1)	Apr (2)	May	Jun (7)	Jul (6)	Aug (1)	Sep	Oct	Nov	Dec
2005	Jan	Feb (1)	Mar	Apr (8)	May	Jun	Jul	Aug (2)	Sep (10)	Oct (4)	Nov (15)	Dec
2006	Jan	Feb (1)	Mar	Apr (4)	May (11)	Jun	Jul	Aug	Sep (2)	Oct	Nov	Dec
2007	Jan (3)	Feb (2)	Mar	Apr (2)	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep (5)	Oct (1)	Nov	Dec
2009	Jan	Feb (1)	Mar	Apr (2)	May	Jun (4)	Jul	Aug (1)	Sep	Oct	Nov	Dec (2)
2010	Jan (1)	Feb	Mar	Apr (8)	May	Jun	Jul	Aug	Sep (6)	Oct	Nov (1)	Dec
2011	Jan	Feb	Mar	Apr	May (3)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2014	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr (1)	May	Jun (1)	Jul	Aug	Sep	Oct	Nov (2)	Dec (1)
2016	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug	Sep	Oct	Nov (2)	Dec (2)

Flat | Threaded

<< < 1 .. 15 16 17 18 19 .. 33 > >> (Page 17 of 33)

Re: [Htmlparser-developer] character encoding (for Derrick)

From: Somik R. <so...@ya...> - 2003-03-30 23:25:06

> It's most likely that users=20
> don't know the character set though, either from the HTTP or HTML=20
> header, so automatic handling by default is best. =20

I agree.

> I noted that four tests are failing after last weeks integration:

>     testStringBeanListener


This one has been failing for a while.=20

> I'm not sure about the others, but the bean listener test is failing=20
> because the handling of tables has changed.  The test is to show that=20
> extracted text contains the link URLs when the links property is set =
to=20
> true. It would now have to dig into the table tags to find the link.=20
>  I'm looking at collectInto() but can't see how to collect string and=20
> link tags so the links can be inserted in context into the text (where =

> they are found).  I'm also wrestling with the issue of handling=20
> <pre></pre>, since collectInto() doesn't seem to be able to give that=20
> kind of information. I guess collectInto() is too blunt a tool.

If you're trying to collect strings AND links and keep them in context, =
your best bet is to write your own visitor.

>     testThreadSafety

Thanks for reporting this - on my end this one's passing. I had left one =
last variable in TagParser- and I thought it would affect Thread safety. =
So I rigged up that test, but surprisingly it passed every time on my =
end. Can you send me the failure message ? I might need to rework =
TagParser again.


>     testScriptCodeExtraction
>     testScriptCodeExtractionWithMultipleQuotes

You can ignore these two - they actually demonstrate a bug which I have =
no clue about, and I think there's little we can do about it. From my =
earlier integration release mail (last week),

Thanks are also due to Huang-Chun Yu for uncovering a serious bug with =
the
script scanning mechanism. The parser can currently handle script tags =
like
:

<script>
<!--
    code here
-->
</script>

But when the tags are like:
<script>
    code here
</script>

the parser is unable to identify the code and treats it like regular =
tags.
Such pages are quite widespread and ought to be supported. I was curious =
if
anyone has ideas on solving this - given the existing design - fresh =
ideas
often lead to a better perspective.=20


Regards,
Somik

Re: [Htmlparser-developer] character encoding (for Derrick)

From: Derrick O. <Der...@ro...> - 2003-03-30 21:19:53

Somik,

The capability to setEncoding() is already there, so API users who know 
the character set can set it before parsing (caveat: there is no test 
case for this, so it may not work).  This can happen before or after the 
connection is opened, but in the latter case will cause an input stream 
reset. In the former case the setting will be overwritten by the 
incoming HTTP and HTML header values if they are there and differ from 
what's set. One possible enhancement would be to not allow the headers 
to override the character set if it's been set via the API, which 
assumes the user knows what they are doing. It's most likely that users 
don't know the character set though, either from the HTTP or HTML 
header, so automatic handling by default is best.  

I noted that four tests are failing after last weeks integration:
    testScriptCodeExtraction
    testScriptCodeExtractionWithMultipleQuotes
    testStringBeanListener
    testThreadSafety

I'm not sure about the others, but the bean listener test is failing 
because the handling of tables has changed.  The test is to show that 
extracted text contains the link URLs when the links property is set to 
true. It would now have to dig into the table tags to find the link. 
 I'm looking at collectInto() but can't see how to collect string and 
link tags so the links can be inserted in context into the text (where 
they are found).  I'm also wrestling with the issue of handling 
<pre></pre>, since collectInto() doesn't seem to be able to give that 
kind of information. I guess collectInto() is too blunt a tool.

Derrick

Somik Raha wrote:

> Hi Derrick,
>     Continuing our earlier discussion, I've had an idea-  instead of 
> re-establishing an input stream, suppose we assume that the parser can 
> be initialized with a character set - and we use that..
>     We could have both strategies in there.
>  
>     Bytway, quite a few steps are failing - I'm guessing that you're 
> actively working on those - let me know if there any issues if I make 
> an integration release this week (in case you don't finish).
>  
> Regards,
> Somik

[Htmlparser-developer] character encoding (for Derrick)

From: Somik R. <so...@ya...> - 2003-03-30 06:25:29

Hi Derrick,
    Continuing our earlier discussion, I've had an idea-  instead of =
re-establishing an input stream, suppose we assume that the parser can =
be initialized with a character set - and we use that..
    We could have both strategies in there.

    Bytway, quite a few steps are failing - I'm guessing that you're =
actively working on those - let me know if there any issues if I make an =
integration release this week (in case you don't finish).

Regards,
Somik

Re: [Htmlparser-developer] RE: [Htmlparser-user] Integration Release 1.3-20030323 is out

From: Somik R. <so...@ya...> - 2003-03-27 06:30:52

Hi Marc,
    I will look into it soon - I am in a conference right now, but should be
on it this weekend. Meanwhile you are free to analyze it, and tell me
anything that you find.

Regards,
Somik
----- Original Message -----
From: "Marc Novakowski" <ma...@ke...>
To: <htm...@li...>
Sent: Monday, March 24, 2003 5:42 PM
Subject: [Htmlparser-developer] RE: [Htmlparser-user] Integration Release
1.3-20030323 is out

By the way, I've entered the OOM exception as a bug (#709152), along with a
simple program that reproduces it.

Marc

-----Original Message-----
From: Marc Novakowski
Sent: Monday, March 24, 2003 3:23 PM
To: htm...@li...
Subject: RE: [Htmlparser-user] Integration Release 1.3-20030323 is out

Somik,

Thanks for fixing 702614!  Unfortunately I can't seem to get the latest
build to work.  It's throwing an OOM exception in my own code when using the
NodeIterator returned by parser.elements().  I'm looking into this to make
sure I'm not doing something stupid in my code.  However, the library seems
to be acting differently than previous releases even out-of-the-box.  For
example, the following used to return a list of the links on Yahoo (in the
0302 release):

java -jar ./htmlparser.jar http://www.yahoo.com -l

In the 0323 release, however, it returns nothing.

Marc

-----Original Message-----
From: Somik Raha [mailto:so...@ya...]
Sent: Sunday, March 23, 2003 5:24 PM
To: HTMLParser Announcement List; HTMLParser User List; HTMLParser
Developer List
Subject: [Htmlparser-user] Integration Release 1.3-20030323 is out

Hi Folks,
    This week's integration release has two important fixes :

Integration build 1.3 - 20030323
--------------------------------
[1] Fixed bug 702547 - single quotes parsed more robustly now
[2] Fixed bug 702614 - empty tags handled correctly now. Tag now has a
method isEmptyXmlTag().

#2 refers to tags like <tag/>.

Thanks to Joe Robbins for a fine bug report that helped in putting in the
fix for #1 faster. Thanks also to Marc Novakowski for the other report.

Thanks are also due to Huang-Chun Yu for uncovering a serious bug with the
script scanning mechanism. The parser can currently handle script tags like
:

<script>
<!--
    code here
-->
</script>

But when the tags are like:
<script>
    code here
</script>

the parser is unable to identify the code and treats it like regular tags.
Such pages are quite widespread and ought to be supported. I was curious if
anyone has ideas on solving this - given the existing design - fresh ideas
often lead to a better perspective. If you have some ideas, feel free to
join the developer list
(http://lists.sourceforge.net/lists/listinfo/htmlparser-developer) and post.

Regards,
Somik

-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open!
Get cracking and register here for some mind boggling fun and
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user

-------------------------------------------------------
This SF.net email is sponsored by:
The Definitive IT and Networking Event. Be There!
NetWorld+Interop Las Vegas 2003 -- Register today!
http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
_______________________________________________
Htmlparser-developer mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

[Htmlparser-developer] RE: [Htmlparser-user] Integration Release 1.3-20030323 is out

From: Marc N. <ma...@ke...> - 2003-03-25 01:43:15

By the way, I've entered the OOM exception as a bug (#709152), along =
with a simple program that reproduces it.

Marc

-----Original Message-----
From: Marc Novakowski=20
Sent: Monday, March 24, 2003 3:23 PM
To: htm...@li...
Subject: RE: [Htmlparser-user] Integration Release 1.3-20030323 is out

Somik,

Thanks for fixing 702614!  Unfortunately I can't seem to get the latest =
build to work.  It's throwing an OOM exception in my own code when using =
the NodeIterator returned by parser.elements().  I'm looking into this =
to make sure I'm not doing something stupid in my code.  However, the =
library seems to be acting differently than previous releases even =
out-of-the-box.  For example, the following used to return a list of the =
links on Yahoo (in the 0302 release):

java -jar ./htmlparser.jar http://www.yahoo.com -l

In the 0323 release, however, it returns nothing.

Marc

-----Original Message-----
From: Somik Raha [mailto:so...@ya...]
Sent: Sunday, March 23, 2003 5:24 PM
To: HTMLParser Announcement List; HTMLParser User List; HTMLParser
Developer List
Subject: [Htmlparser-user] Integration Release 1.3-20030323 is out

Hi Folks,
    This week's integration release has two important fixes :

Integration build 1.3 - 20030323
--------------------------------
[1] Fixed bug 702547 - single quotes parsed more robustly now
[2] Fixed bug 702614 - empty tags handled correctly now. Tag now has a
method isEmptyXmlTag().

#2 refers to tags like <tag/>.

Thanks to Joe Robbins for a fine bug report that helped in putting in =
the
fix for #1 faster. Thanks also to Marc Novakowski for the other report.

Thanks are also due to Huang-Chun Yu for uncovering a serious bug with =
the
script scanning mechanism. The parser can currently handle script tags =
like
:

<script>
<!--
    code here
-->
</script>

But when the tags are like:
<script>
    code here
</script>

the parser is unable to identify the code and treats it like regular =
tags.
Such pages are quite widespread and ought to be supported. I was curious =
if
anyone has ideas on solving this - given the existing design - fresh =
ideas
often lead to a better perspective. If you have some ideas, feel free to
join the developer list
(http://lists.sourceforge.net/lists/listinfo/htmlparser-developer) and =
post.

Regards,
Somik

-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open!=20
Get cracking and register here for some mind boggling fun and=20
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user

[Htmlparser-developer] Integration Release 1.3-20030323 is out

From: Somik R. <so...@ya...> - 2003-03-24 01:22:12

Hi Folks,
    This week's integration release has two important fixes :

Integration build 1.3 - 20030323
--------------------------------
[1] Fixed bug 702547 - single quotes parsed more robustly now
[2] Fixed bug 702614 - empty tags handled correctly now. Tag now has a
method isEmptyXmlTag().

#2 refers to tags like <tag/>.

Thanks to Joe Robbins for a fine bug report that helped in putting in the
fix for #1 faster. Thanks also to Marc Novakowski for the other report.

Thanks are also due to Huang-Chun Yu for uncovering a serious bug with the
script scanning mechanism. The parser can currently handle script tags like
:

<script>
<!--
    code here
-->
</script>

But when the tags are like:
<script>
    code here
</script>

the parser is unable to identify the code and treats it like regular tags.
Such pages are quite widespread and ought to be supported. I was curious if
anyone has ideas on solving this - given the existing design - fresh ideas
often lead to a better perspective. If you have some ideas, feel free to
join the developer list
(http://lists.sourceforge.net/lists/listinfo/htmlparser-developer) and post.

Regards,
Somik

[Htmlparser-developer] Future Directions (Derrick Oswald, Dave Knipp, James Crowley)

From: Somik R. <so...@ya...> - 2003-03-16 21:50:44

Hi Folks,
    Thanks are due to Derrick Oswald, Dhaval Udani for their support on =
the mailing lists, and to Josh Kerievsky for having shown the =
refactoring direction which is making a world of difference to the =
project.
   =20
    Derrick -> It will be really nice to have some docs about your =
contribution - could you add a section to the Wiki ?
    Also, one test seems to be failing in testStringBeanListener(). I =
couldnt figure it out, so I was wondering if you could look into it ?

    Dave Knipp & others -> I have checked in a module called =
WikiCapturer. This project uses the parser and converts a standard wiki =
to static html. If you're interested, you could take over this module =
and make it a product in its own right - to handle php and modwiki (or =
any other). It would be a useful thing to have -perhaps with a GUI.
   =20
    James Crowley -> Thanks for the offer of the J# version and C# =
version. We can make a release of the former as soon as you are ready. =
You could take over the J# section of the htmlparser project.=20

Regards,
Somik=20
(PS: James, Dave, I am not sure if you folks are on the developer =
mailing list, let me know if you are, and I wont cc you explicitly)

[Htmlparser-developer] Major Milestone: Integration Release 1.3-20030316 is out

From: Somik R. <so...@ya...> - 2003-03-16 21:36:46

Hi Folks,
    This is a major milestone release. A massive refactoring has been
completed (took two weeks) - which has brought all the robust error handling
cases into CompositeTagScanner. This means, all tags that have children will
be able to do error correction uniformly. Form tag (and table tags too)
should be robust.

    Table tags are not yet in the standard set of scanners (you still need
to add them manually). They should make the cut next week.
    We have a new method - registerDomScanners() in Parser - that allows you
to build html dom objects.

    Interesting fact, as a result of the refactorings, the LOC of the
scanners package has reduced from 1553 to 1355 (I was surprised at the
digits).

    Documentation has been updated - we've started putting up answers by our
list members to common questions. Pls feel free to update the Wiki and
improve it. No login is required.

    From the change log:

Integration build 1.3 - 20030316
--------------------------------
[1] Added method finishedParsing() to NodeVisitor
[2] LinkScanner uses CompositeTagScanner.scan()
[3] BulletScanner added
[4] FormScanner uses CompositeTagScanner.scan()
[5] AppletScanner uses CompositeTagScanner.scan()

    We highly recommend an upgrade to this version.

Regards,
Somik

Re: [Htmlparser-developer] Webase test and Form tag scanner?

From: Mr L. MA <law...@ya...> - 2003-03-09 23:08:54

If you have a ftp site, I can upload exception pages
to it daily.

Ling Ma
--- Somik Raha <so...@ya...> wrote:
> 
> 
> > One problem I had with FormTag.toString() method
> is
> > that form tag should be treated as body tag since
> any
> > other tags could be nested in it.
> >
> > The ultimate htmlparser test would be webase
> > collection from stanford.
> 
> What you could really do to speed up our testing is
> to provide us with urls
> that cause breaks - and keep filing lots of bug
> reports. That would be a
> great help.
> 
> > Is there a way even with readelements=null I can
> still
> > get the rest nodes?
> 
> This usually means the parser has reached the end of
> the page without
> finding a matching end tag. It is usually a fatal
> error. But this week, I am
> planning to improve robustness - systemwide. It
> would be good to have some
> nice bug reports before I start, though.
> 
> Regards,
> Somik
> 
> 
> 
>
-------------------------------------------------------
> This SF.net email is sponsored by: Etnus, makers of
> TotalView, The debugger 
> for complex code. Debugging C/C++ programs can leave
> you feeling lost and 
> disoriented. TotalView can help you find your way.
> Available on major UNIX 
> and Linux platforms. Try it free. www.etnus.com
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
>
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/

[Htmlparser-developer] javascript tag parsing error and string indexout of bound error

From: Mr L. MA <law...@ya...> - 2003-03-09 23:07:32

Attachments: markets_morning_call.html markets_europe_ftse100.html

Can someone look for while parsing this two HTML
pages?

The parser throws exceptions.

Ling Ma

__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/

Re: [Htmlparser-developer] HTMLParser License requirements in a commercial app

From: Somik R. <so...@ya...> - 2003-03-08 05:12:12

Hi Richard,

> Could someone clarify the licensing situation / fulfilment requirements
> of HTMLParser with regard to its inclusion as part of an otherwise
> closed-source commercial app.

Thanks for bringing up this question. The parser is licensed under LGPL.
This means, applications that USE it dont have to be open-source. But, here
are two restrictions that apply:
[1] Any modifications made to the library itself must be kept open-source or
made available.
[2] Your app source code does not live with the parser source code, but the
object code does. That means - people should either be able to reverse
engineer your product so as to be able to remove the parser library and put
a newer version in (gasp!) or - simply provide an external linkage to the
parser - whereby folks can swap out the current version with a later version
(the idea is to let them have the benefit of the open-source library). That
reverse engineering stuff is actually a cryptic interpretation of the
clause - applicable only if you want to provide a single executable in your
application (it can be bypassed, but I dont want to further complicate the
interpretation for you - let me know if this is the case and I can advise
you accordingly).

Bytway, if you are not distributing your application, and only using it
internally, none of the above applies. Let me know if that answers your
question.

Regards,
Somik
********************************************
  Somik Raha
  Extreme Programmer and Coach
  Industrial Logic, Inc.
  so...@in...
  http://industriallogic.com
  Voice : 510-540-8336
  Fax   : 510-540-8936
********************************************
Periodic reassessment means looking at things which are taken for granted,
things which seem beyond doubt.
Periodic reassessment means challenging all assumptions. It is not a matter
of reassessing something because there is a need to reassess it;
there may be no need at all. It is a matter of reassessing something simply
because it is there and has not been assessed for a long time.
It is a deliberate and quite unjustified attempt to look at things in a new
way.

--- Edward De Bono in Lateral Thinking, Chapter 5, The Use of Lateral
Thinking

[Htmlparser-developer] HTMLParser License requirements in a commercial app

From: Richard W. <ri...@ri...> - 2003-03-07 10:18:17

Hi,

Could someone clarify the licensing situation / fulfilment requirements
of HTMLParser with regard to its inclusion as part of an otherwise
closed-source commercial app.

Richard.

Re: [Htmlparser-developer] Webase test and Form tag scanner?

From: Somik R. <so...@ya...> - 2003-03-07 03:11:28

> One problem I had with FormTag.toString() method is
> that form tag should be treated as body tag since any
> other tags could be nested in it.
>
> The ultimate htmlparser test would be webase
> collection from stanford.

What you could really do to speed up our testing is to provide us with urls
that cause breaks - and keep filing lots of bug reports. That would be a
great help.

> Is there a way even with readelements=null I can still
> get the rest nodes?

This usually means the parser has reached the end of the page without
finding a matching end tag. It is usually a fatal error. But this week, I am
planning to improve robustness - systemwide. It would be good to have some
nice bug reports before I start, though.

Regards,
Somik

[Htmlparser-developer] Webase test and Form tag scanner?

From: Mr L. MA <law...@ya...> - 2003-03-06 17:31:08

One problem I had with FormTag.toString() method is
that form tag should be treated as body tag since any
other tags could be nested in it.

The ultimate htmlparser test would be webase
collection from stanford.

What I did is to download a website with a offline
browser ( such as webstripper)

Running StringExtractor on the local collection gives
many ParserExceptions.

Sometimes with JTidy I can get luck on some pages
before apply HTMLParser, sometimes not.

My focus is to use HTMLParser for text extraction, so
I came into "dirty" pages that HTMLParser gives error.

Is there a way even with readelements=null I can still
get the rest nodes?

Ling Ma

--- Somik Raha <so...@ya...> wrote:
> Thanks very much for the sample page. My to do list
> for this week :
> [1] Refactor correction logic in the link scanner to
> the composite scanner,
> so that it becomes available for all composite tags.
> That will solve the
> problem you mention.
> 
> [2] Work on Dhaval's suggestion - I have some ideas
> about switching off
> testcases that require the internet.
> 
> Regards,
> Somik
> ----- Original Message -----
> From: "Mr LING MA" <law...@ya...>
> To: <htm...@li...>
> Sent: Wednesday, March 05, 2003 10:34 PM
> Subject: [Htmlparser-developer] Form tag should not
> be composite tag?
> 
> 
> > Hi all:
> > Do you guys think form tag should not be composite
> > tag?
> > or else it cannot process page like:
> >
> > http://money.cnn.com/services/glossary/a.html
> >
> > which misses one form end tag.
> >
> > Ling Ma
> >
> > __________________________________________________
> > Do you Yahoo!?
> > Yahoo! Tax Center - forms, calculators, tips, more
> > http://taxes.yahoo.com/
> >
> >
> >
>
-------------------------------------------------------
> > This SF.net email is sponsored by: Etnus, makers
> of TotalView, The
> debugger
> > for complex code. Debugging C/C++ programs can
> leave you feeling lost and
> > disoriented. TotalView can help you find your way.
> Available on major UNIX
> > and Linux platforms. Try it free. www.etnus.com
> > _______________________________________________
> > Htmlparser-developer mailing list
> > Htm...@li...
> >
>
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
> 
> 
> 
>
-------------------------------------------------------
> This SF.net email is sponsored by: Etnus, makers of
> TotalView, The debugger 
> for complex code. Debugging C/C++ programs can leave
> you feeling lost and 
> disoriented. TotalView can help you find your way.
> Available on major UNIX 
> and Linux platforms. Try it free. www.etnus.com
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
>
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/

Re: [Htmlparser-developer] FW: Open Source Research

From: Somik R. <so...@ya...> - 2003-03-06 15:24:56

I got it sometime back and did fill up the form. Cant say if its
authentic...

Regards,
Somik

----- Original Message -----
From: <dha...@or...>
To: <htm...@li...>
Sent: Thursday, March 06, 2003 4:34 AM
Subject: [Htmlparser-developer] FW: Open Source Research


> Has anyone else got a mail like this? Is it authentic?
>
> Regards,
>
> Dhaval Udani
> Senior Analyst
> M-Line, QPEG
> OrbiTech Solutions Ltd.
> +91-22-28290019 Extn. 1457
>
>
>
> -----Original Message-----
> From: cantamessa [mailto:can...@us...]
> Sent: Monday, February 17, 2003 10:42 PM
> To: dhavaludani
> Cc: cantamessa
> Subject: Open Source Research
>
>
> Dear Sourceforge developer,
>
> The Department of  Manufacturing and Economics of the
> Politecnico di Torino, Italy, is running a research project on
> Open Source Software. Within the project we aim to identify
> key success factors in the management of open source
> projects. The project HTML Parser  you are cooperating with
> has been selected from www.sourceforge.net to be part of a
> sample of 100 successful projects to be analyzed, and we
> therefore kindly ask you to give us a few minutes of your time
> in order to fill in the attached questionnaire
> https://lepshare.aigest.it/quest/encuesta.asp
>
> We ensure that the data thus gathered will be kept with the
> utmost confidentiality, will be analyzed with statistical
> techniques and results will be presented only in aggregate
> form. If you wish, we will be happy to send you a copy of the
> report with the results of our project.
>
> For the sake of security we will ask you to fill in a secret
> code, yours is 3343
>
> For further information, feel free to contact us at our e-mail
> address os...@le... .
> Please, respond in the next few days.
> I thank you for your help and remain
>
> Yours Sincerely
>
>
> Prof. Ing. Marco Cantamessa
> mar...@po...
> Dipartimento di Sistemi di Produzione ed Economia
> dell'Azienda
> Politecnico di Torino
> Corso Duca degli Abruzzi 24 - I 10129 Torino (Italy)
> tel. +39-0115647223, fax +39-0115647299
>
>
>


----------------------------------------------------------------------------
----


> Received: from myrtle1.citicorp.com (myrtle1-b.citicorp.com
[192.193.249.35])
> by elaralan1.email.citicorp.com (8.8.6 (PHNE_17135)/8.8.6) with ESMTP id
WAA29698
> for <dha...@or...>; Mon, 17 Feb 2003 22:34:52 +0530 (IST)
> Received: from citicorp.com (localhost [127.0.0.1])
> by myrtle1.citicorp.com (8.12.5/8.12.5) with ESMTP id h1HH4mNN009389
> for <dha...@or...>; Mon, 17 Feb 2003 12:04:49 -0500 (EST)
> Received: from sc8-sf-list1.sourceforge.net (lists.sourceforge.net
[66.35.250.206])
> by citicorp.com (8.9.3/8.9.3) with ESMTP id MAA01295
> for <dha...@or...>; Mon, 17 Feb 2003 12:03:43 -0500 (EST)
> Received: from sc8-sf-sshgate.sourceforge.net ([66.35.250.220]
helo=sc8-sf-netmisc.sourceforge.net)
> by sc8-sf-list1.sourceforge.net with esmtp
> (Cipher TLSv1:DES-CBC3-SHA:168) (Exim 3.31-VA-mm2 #1 (Debian))
> id 18koh9-0005ax-00
> for <dha...@us...>; Mon, 17 Feb 2003 09:04:43 -0800
> Received: from sc8-sf-web2-b.sourceforge.net ([10.3.1.22]
helo=sc8-sf-web2.sourceforge.net)
> by sc8-sf-netmisc.sourceforge.net with esmtp (Exim 3.36 #1 (Debian))
> id 18koh9-0003Oo-00
> for <dha...@us...>; Mon, 17 Feb 2003 09:04:43 -0800
> Received: from nobody by sc8-sf-web2.sourceforge.net with local (Exim 3.36
#1 (Debian))
> id 18kooX-00084q-00
> for <dha...@us...>; Mon, 17 Feb 2003 09:12:21 -0800
> To: dha...@us...
> Subject: Open Source Research
> From: Marco Cantamessa <can...@us...>
> Message-Id: <E18...@sc...>
> Date: Mon, 17 Feb 2003 09:12:21 -0800
>

Re: [Htmlparser-developer] Form tag should not be composite tag?

From: Somik R. <so...@ya...> - 2003-03-06 15:05:15

Thanks very much for the sample page. My to do list for this week :
[1] Refactor correction logic in the link scanner to the composite scanner,
so that it becomes available for all composite tags. That will solve the
problem you mention.

[2] Work on Dhaval's suggestion - I have some ideas about switching off
testcases that require the internet.

Regards,
Somik
----- Original Message -----
From: "Mr LING MA" <law...@ya...>
To: <htm...@li...>
Sent: Wednesday, March 05, 2003 10:34 PM
Subject: [Htmlparser-developer] Form tag should not be composite tag?


> Hi all:
> Do you guys think form tag should not be composite
> tag?
> or else it cannot process page like:
>
> http://money.cnn.com/services/glossary/a.html
>
> which misses one form end tag.
>
> Ling Ma
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Tax Center - forms, calculators, tips, more
> http://taxes.yahoo.com/
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Etnus, makers of TotalView, The
debugger
> for complex code. Debugging C/C++ programs can leave you feeling lost and
> disoriented. TotalView can help you find your way. Available on major UNIX
> and Linux platforms. Try it free. www.etnus.com
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

[Htmlparser-developer] FW: Open Source Research

From: <dha...@or...> - 2003-03-06 12:34:50

Received: from myrtle1.citicorp.com (myrtle1-b.citicorp.com [192.193.249.35])
	by elaralan1.email.citicorp.com (8.8.6 (PHNE_17135)/8.8.6) with ESMTP id WAA29698
	for <dha...@or...>; Mon, 17 Feb 2003 22:34:52 +0530 (IST)
Received: from citicorp.com (localhost [127.0.0.1])
	by myrtle1.citicorp.com (8.12.5/8.12.5) with ESMTP id h1HH4mNN009389
	for <dha...@or...>; Mon, 17 Feb 2003 12:04:49 -0500 (EST)
Received: from sc8-sf-list1.sourceforge.net (lists.sourceforge.net [66.35.250.206])
	by citicorp.com (8.9.3/8.9.3) with ESMTP id MAA01295
	for <dha...@or...>; Mon, 17 Feb 2003 12:03:43 -0500 (EST)
Received: from sc8-sf-sshgate.sourceforge.net ([66.35.250.220] helo=sc8-sf-netmisc.sourceforge.net)
	by sc8-sf-list1.sourceforge.net with esmtp 
	(Cipher TLSv1:DES-CBC3-SHA:168) (Exim 3.31-VA-mm2 #1 (Debian))
	id 18koh9-0005ax-00
	for <dha...@us...>; Mon, 17 Feb 2003 09:04:43 -0800
Received: from sc8-sf-web2-b.sourceforge.net ([10.3.1.22] helo=sc8-sf-web2.sourceforge.net)
	by sc8-sf-netmisc.sourceforge.net with esmtp (Exim 3.36 #1 (Debian))
	id 18koh9-0003Oo-00
	for <dha...@us...>; Mon, 17 Feb 2003 09:04:43 -0800
Received: from nobody by sc8-sf-web2.sourceforge.net with local (Exim 3.36 #1 (Debian))
	id 18kooX-00084q-00
	for <dha...@us...>; Mon, 17 Feb 2003 09:12:21 -0800
To: dha...@us...
Subject: Open Source Research
From: Marco Cantamessa <can...@us...>
Message-Id: <E18...@sc...>
Date: Mon, 17 Feb 2003 09:12:21 -0800

[Htmlparser-developer] Form tag should not be composite tag?

From: Mr L. MA <law...@ya...> - 2003-03-06 06:34:42

Hi all:
Do you guys think form tag should not be composite
tag?
or else it cannot process page like:

http://money.cnn.com/services/glossary/a.html

which misses one form end tag.

Ling Ma

__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/

Re: [Htmlparser-developer] Previous integration releases

From: Somik R. <so...@ya...> - 2003-03-04 14:45:13

Let me know the version, and I'll make it available for you.

Regards,
Somik
----- Original Message ----- 
From: <dha...@or...>
To: <htm...@li...>
Sent: Tuesday, March 04, 2003 3:36 AM
Subject: [Htmlparser-developer] Previous integration releases


> Hi,
> 
> My product is using a very old version of HTMLParser. I am not allowed
> to distribute its jar file hecne I ask people to come to the website and
> downlaod the appropriate version which in my case is some particular
> integration build of 1.2. However the problem is that I can't find it.
> Can someone tell me how to locate a particular integration build
> release?
> 
> Regards,
> 
> Dhaval Udani
> Senior Analyst
> M-Line, QPEG
> OrbiTech Solutions Ltd.
> +91-22-28290019 Extn. 1457
> 
> 
>

[Htmlparser-developer] Previous integration releases

From: <dha...@or...> - 2003-03-04 11:41:04

Attachments: BDY.RTF

Hi,

My product is using a very old version of HTMLParser. I am not allowed
to distribute its jar file hecne I ask people to come to the website and
downlaod the appropriate version which in my case is some particular
integration build of 1.2. However the problem is that I can't find it.
Can someone tell me how to locate a particular integration build
release?

Regards,

Dhaval Udani
Senior Analyst
M-Line, QPEG
OrbiTech Solutions Ltd.
+91-22-28290019 Extn. 1457

[Htmlparser-developer] Integration Release 1.3-20030302 is out

From: Somik R. <so...@ya...> - 2003-03-03 03:52:39

Hi Folks,
    In this week's release, the change log is :

Integration build 1.3 - 20030302
--------------------------------
[1] Fixed bug in LinkScanner
[2] Cleaned up StringNode interface
[3] Cleaned up RemarkNode interface
[4] Refactored Parser, created ParserHelper

Regards,
Somik

[Htmlparser-developer] Re: [Htmlparser-user] Node.collectInto()

From: Somik R. <so...@ya...> - 2003-03-03 00:09:27

Joe Lin wrote:
> Anoter question regarding the collectInto(NodeList
> collectionList, java.lang.String filter) method, I
> could not seem to find the filter constants for
> different Node type. Can anyone point me where these
> are?

After moving to the class parameters, this method has become redundant.
We're planning to take it out. You're better off using the other techniques
(the other collectInto or TagFindingVisitor).

> BTW, I think HTMLParser is a great software. I have
> being looking for Java html parser high and low.
> HTMLParser represent a best architecture and user API
> to me. I especially like that it is in a sense a
> steaming parser. This means performance and optimal
> memory usage for me.

Thanks for the kind words. We've got a diverse and talented set of people
who've been making contributions over a period of time. Kind words always
help inspire us to serve the community better.

Regards,
Somik

[Htmlparser-developer] Re: [Htmlparser-user] Integration Release 1.3-20030223 is out (API changes)

From: Somik R. <so...@ya...> - 2003-02-24 18:12:00

I was trying to integrate the changes of the latest
parser with some existing projects at work - and of
course, I had to modify the code to use the new API.

I had some suggestions - as I know many of you will be
facing the same issue. I use Eclipse, and I hope most
of you use a decent IDE that supports refactoring. Get
the parser into your IDE, and let all your other
project code refer to it (thats how it is setup in my
IDE). Then, rename Parser to HTMLParser using your
refactoring tool. Rename it back to Parser, and all
your existing code will automatically get fixed. Do
this for some other classes like HTMLNode/Node, etc..
and within minutes it should be done.

Regards,
Somik

--- Somik Raha <so...@ya...> wrote:
> Hi Folks,
>     This week's release is out. I've finally taken
> heed of all the feedback
> I had been receiving about the terrible naming
> convention, and have removed
> "HTML" from all class names. In addition,
> HTMLEnumeration is now
> NodeIterator and SimpleEnumeration is
> SimpleNodeIterator. HTMLParser is just
> Parser.
> 
>     This is a big step, so to make it easy for
> everyone, there have been no
> major bug fixes that will require you to upgrade
> right away. I apologize in
> advance for inconvenience caused - I hope you don't
> curse me too much for
> having to modify your programs. I had the option of
> doing it in stages, and
> forcing you to modify some small thing in every
> release, or get it over with
> in one sweep. I chose the latter bcos there were too
> many changes and
> suffering over a long period of time didn't make
> sense. Hopefully, once you
> have migrated to the new names, you will appreciate
> not having to type
> "HTML" each time.
> 
>     The BodyScanner contributed by Dhaval Udani is
> finally in (Dhaval -
> sorry for the delay).
>     The interesting part is that the documentation
> accompanying the package
> is now the latest one on the site - it has been
> ripped off a Php Wiki. I am
> thinking that the ripping program might be useful
> for those who wish to
> provide wiki content as offline documentation (any
> feedback on this is
> welcome).
> 
>     From the change log :
> Integration build 1.3 - 20030223
> --------------------------------
> [1] Modification of documentation packaging
> - the new documentation is actually produced
> by a tiny program that coverts wiki pages
> into documentation (works with PhpWiki)
> [2] Inclusion of BodyScanner, BodyTag
> [3] HTMLVisitor is now NodeVisitor - and has an
> extra param to
> visit itself
> [4] HTMLParser is now Parser. No class has HTML
> prefix anymore.
> [5] HTMLEnumeration is now NodeIterator,
> SimpleEnumeration is
> SimpleNodeIterator
> 
> Regards,
> Somik
> 
> 
> 
>
-------------------------------------------------------
> This SF.net email is sponsored by: SlickEdit Inc.
> Develop an edge.
> The most comprehensive and flexible code editor you
> can use.
> Code faster. C/C++, C#, Java, HTML, XML, many more.
> FREE 30-Day Trial.
> www.slickedit.com/sourceforge
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
>
https://lists.sourceforge.net/lists/listinfo/htmlparser-user


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/

[Htmlparser-developer] Integration Release 1.3-20030223 is out (API changes)

From: Somik R. <so...@ya...> - 2003-02-24 06:15:43

Hi Folks,
    This week's release is out. I've finally taken heed of all the feedback
I had been receiving about the terrible naming convention, and have removed
"HTML" from all class names. In addition, HTMLEnumeration is now
NodeIterator and SimpleEnumeration is SimpleNodeIterator. HTMLParser is just
Parser.

    This is a big step, so to make it easy for everyone, there have been no
major bug fixes that will require you to upgrade right away. I apologize in
advance for inconvenience caused - I hope you don't curse me too much for
having to modify your programs. I had the option of doing it in stages, and
forcing you to modify some small thing in every release, or get it over with
in one sweep. I chose the latter bcos there were too many changes and
suffering over a long period of time didn't make sense. Hopefully, once you
have migrated to the new names, you will appreciate not having to type
"HTML" each time.

    The BodyScanner contributed by Dhaval Udani is finally in (Dhaval -
sorry for the delay).
    The interesting part is that the documentation accompanying the package
is now the latest one on the site - it has been ripped off a Php Wiki. I am
thinking that the ripping program might be useful for those who wish to
provide wiki content as offline documentation (any feedback on this is
welcome).

    From the change log :
Integration build 1.3 - 20030223
--------------------------------
[1] Modification of documentation packaging
- the new documentation is actually produced
by a tiny program that coverts wiki pages
into documentation (works with PhpWiki)
[2] Inclusion of BodyScanner, BodyTag
[3] HTMLVisitor is now NodeVisitor - and has an extra param to
visit itself
[4] HTMLParser is now Parser. No class has HTML prefix anymore.
[5] HTMLEnumeration is now NodeIterator, SimpleEnumeration is
SimpleNodeIterator

Regards,
Somik

Re: [Htmlparser-developer] Extract links...

From: Derrick O. <Der...@ro...> - 2003-02-16 17:29:40

JJ, Somik,

I looked at it briefly, and saw that the fetch is returning 403 - access 
prohibited.
In the past, when I've experienced this, there is usually some header 
field on the connection that needs to be set, like Accept-Language, 
Referer or User-Agent.
I don't think this can be solved in a general way.  I believe it needs 
to be specified differently for different servers and queries.
See testPost() in HTMLParserTest.java for how to set header fields on 
the connection.
Some experimentation will be required.

Derrick

Somik Raha wrote:

>>Question: How I can extract links of a page as:
>>
>>http://www.google.com/search?q=universe
>>    
>>
>
>Don't know why this is happening- Derrick ? 
>
>Regards,
>Somik
>
>
>  
>

14 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 15 16 17 18 19 .. 33 > >> (Page 17 of 33)