htmlparser-cvs Mailing List for HTML Parser (Page 20)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(141) |
Jun
(108) |
Jul
(66) |
Aug
(127) |
Sep
(155) |
Oct
(149) |
Nov
(72) |
Dec
(72) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(100) |
Feb
(36) |
Mar
(21) |
Apr
(3) |
May
(87) |
Jun
(28) |
Jul
(84) |
Aug
(5) |
Sep
(14) |
Oct
|
Nov
|
Dec
|
2005 |
Jan
(1) |
Feb
(39) |
Mar
(26) |
Apr
(38) |
May
(14) |
Jun
(10) |
Jul
|
Aug
|
Sep
(13) |
Oct
(8) |
Nov
(10) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(24) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Derrick O. <der...@us...> - 2004-03-18 04:13:47
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24099/src/org/htmlparser Modified Files: Parser.java Log Message: Deprecate LinkProcessor. Functionality moved to Page. Index: Parser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Parser.java,v retrieving revision 1.89 retrieving revision 1.90 diff -C2 -d -r1.89 -r1.90 *** Parser.java 14 Mar 2004 16:31:40 -0000 1.89 --- Parser.java 18 Mar 2004 04:04:07 -0000 1.90 *************** *** 44,48 **** import org.htmlparser.util.DefaultParserFeedback; import org.htmlparser.util.IteratorImpl; - import org.htmlparser.util.LinkProcessor; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.NodeList; --- 44,47 ---- *************** *** 607,610 **** --- 606,641 ---- /** + * Turn spaces into %20. + * @param url The url containing spaces. + * @return The URL with spaces as %20 sequences. + */ + public static String fixSpaces (String url) + { + int index; + int length; + char ch; + StringBuffer returnURL; + + index = url.indexOf (' '); + if (-1 != index) + { + length = url.length (); + returnURL = new StringBuffer (length * 3); + returnURL.append (url.substring (0, index)); + for (int i = index; i < length; i++) + { + ch = url.charAt (i); + if (ch==' ') + returnURL.append ("%20"); + else + returnURL.append (ch); + } + url = returnURL.toString (); + } + + return (url); + } + + /** * Opens a connection based on a given string. * The string is either a file, in which case <code>file://localhost</code> *************** *** 628,632 **** try { ! url = new URL (LinkProcessor.fixSpaces (string)); ret = openConnection (url, feedback); } --- 659,663 ---- try { ! url = new URL (fixSpaces (string)); ret = openConnection (url, feedback); } *************** *** 642,646 **** buffer.append ("/"); buffer.append (resource); ! url = new URL (LinkProcessor.fixSpaces (buffer.toString ())); ret = openConnection (url, feedback); if (null != feedback) --- 673,677 ---- buffer.append ("/"); buffer.append (resource); ! url = new URL (fixSpaces (buffer.toString ())); ret = openConnection (url, feedback); if (null != feedback) |
From: Derrick O. <der...@us...> - 2004-03-18 04:13:47
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24099/src/org/htmlparser/tests Modified Files: ParserTest.java Log Message: Deprecate LinkProcessor. Functionality moved to Page. Index: ParserTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/ParserTest.java,v retrieving revision 1.57 retrieving revision 1.58 diff -C2 -d -r1.57 -r1.58 *** ParserTest.java 29 Feb 2004 12:52:21 -0000 1.57 --- ParserTest.java 18 Mar 2004 04:04:08 -0000 1.58 *************** *** 956,958 **** --- 956,965 ---- assertTrue ("toString wrong", rem.toString ().endsWith (newtext)); } + + public void testFixSpaces () throws ParserException + { + String url = "http://htmlparser.sourceforge.net/test/This is a Test Page.html"; + parser = new Parser (url); + assertEquals("Expected","http://htmlparser.sourceforge.net/test/This%20is%20a%20Test%20Page.html", parser.getURL ()); + } } |
From: Derrick O. <der...@us...> - 2004-03-18 04:13:47
|
Update of /cvsroot/htmlparser/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24099 Modified Files: build.xml Log Message: Deprecate LinkProcessor. Functionality moved to Page. Index: build.xml =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/build.xml,v retrieving revision 1.61 retrieving revision 1.62 diff -C2 -d -r1.61 -r1.62 *** build.xml 14 Mar 2004 20:31:38 -0000 1.61 --- build.xml 18 Mar 2004 04:04:07 -0000 1.62 *************** *** 235,239 **** <include name="org/htmlparser/util/SimpleNodeIterator.class"/> <include name="org/htmlparser/util/SpecialHashtable.class"/> - <include name="org/htmlparser/util/LinkProcessor.class"/> <include name="org/htmlparser/util/EncodingChangeException.class"/> <include name="org/htmlparser/util/sort/**/*.class"/> --- 235,238 ---- |
From: Derrick O. <der...@us...> - 2004-03-18 04:13:47
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24099/src/org/htmlparser/tags Modified Files: BaseHrefTag.java FormTag.java FrameTag.java ImageTag.java LinkTag.java Log Message: Deprecate LinkProcessor. Functionality moved to Page. Index: ImageTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/ImageTag.java,v retrieving revision 1.43 retrieving revision 1.44 diff -C2 -d -r1.43 -r1.44 *** ImageTag.java 29 Feb 2004 12:52:21 -0000 1.43 --- ImageTag.java 18 Mar 2004 04:04:08 -0000 1.44 *************** *** 126,130 **** { // missing equals sign ! ret = string.substring (3); state = 0; // go back to searching for SRC // because, maybe we found SRCXXX --- 126,137 ---- { // missing equals sign ! string = string.substring (3); ! // remove any double quotes from around string ! if (string.startsWith ("\"") && string.endsWith ("\"") && (1 < string.length ())) ! string = string.substring (1, string.length () - 1); ! // remove any single quote from around string ! if (string.startsWith ("'") && string.endsWith ("'") && (1 < string.length ())) ! string = string.substring (1, string.length () - 1); ! ret = string; state = 0; // go back to searching for SRC // because, maybe we found SRCXXX *************** *** 178,182 **** if (null == imageURL) if (null != getPage ()) ! imageURL = getPage ().getLinkProcessor ().extract (extractImageLocn (), getPage().getUrl ()); return (imageURL); } --- 185,190 ---- if (null == imageURL) if (null != getPage ()) ! imageURL = getPage ().getAbsoluteURL (extractImageLocn ()); ! return (imageURL); } Index: FrameTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/FrameTag.java,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** FrameTag.java 14 Jan 2004 02:53:46 -0000 1.35 --- FrameTag.java 18 Mar 2004 04:04:08 -0000 1.36 *************** *** 59,69 **** public String getFrameLocation () { ! String src; ! src = getAttribute ("SRC"); ! if (null == src) ! return ""; ! else ! return (getPage ().getLinkProcessor ().extract (src, getPage ().getUrl ())); } --- 59,71 ---- public String getFrameLocation () { ! String ret; ! ret = getAttribute ("SRC"); ! if (null == ret) ! ret = ""; ! else if (null != getPage ()) ! ret = getPage ().getAbsoluteURL (ret); ! ! return (ret); } Index: LinkTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/LinkTag.java,v retrieving revision 1.48 retrieving revision 1.49 diff -C2 -d -r1.48 -r1.49 *** LinkTag.java 29 Feb 2004 12:52:21 -0000 1.48 --- LinkTag.java 18 Mar 2004 04:04:08 -0000 1.49 *************** *** 318,332 **** /** * Extract the link from the HREF attribute. ! * The URL of the actual html page is also provided. */ public String extractLink () { ! String relativeLink = getAttribute ("HREF"); ! if (relativeLink!=null) { ! relativeLink = ParserUtils.removeChars(relativeLink,'\n'); ! relativeLink = ParserUtils.removeChars(relativeLink,'\r'); } ! return (getPage ().getLinkProcessor ().extract (relativeLink, getPage ().getUrl ())); } } --- 318,338 ---- /** * Extract the link from the HREF attribute. ! * @return The URL from the HREF attibute. This is absolute if the tag has ! * a valid page. */ public String extractLink () { ! String ret; ! ! ret = getAttribute ("HREF"); ! if (null != ret) { ! ret = ParserUtils.removeChars (ret,'\n'); ! ret = ParserUtils.removeChars (ret,'\r'); } ! if (null != getPage ()) ! ret = getPage ().getAbsoluteURL (ret); ! ! return (ret); } } Index: FormTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/FormTag.java,v retrieving revision 1.47 retrieving revision 1.48 diff -C2 -d -r1.47 -r1.48 *** FormTag.java 24 Jan 2004 23:57:52 -0000 1.47 --- FormTag.java 18 Mar 2004 04:04:08 -0000 1.48 *************** *** 115,119 **** if (null == mFormLocation) // ... is it true that without an ACTION the default is to send it back to the same page? ! mFormLocation = extractFormLocn (getPage ().getUrl ()); return (mFormLocation); --- 115,119 ---- if (null == mFormLocation) // ... is it true that without an ACTION the default is to send it back to the same page? ! mFormLocation = extractFormLocn (); return (mFormLocation); *************** *** 215,227 **** * @param url URL of web page being parsed. */ ! public String extractFormLocn(String url)// throws ParserException { ! String formURL; ! formURL = getAttribute("ACTION"); ! if (null == formURL) ! return ""; ! else ! return (getPage ().getLinkProcessor ().extract (formURL, url)); } } --- 215,229 ---- * @param url URL of web page being parsed. */ ! public String extractFormLocn () { ! String ret; ! ret = getAttribute("ACTION"); ! if (null == ret) ! ret = ""; ! else if (null != getPage ()) ! ret = getPage ().getAbsoluteURL (ret); ! ! return (ret); } } Index: BaseHrefTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/BaseHrefTag.java,v retrieving revision 1.37 retrieving revision 1.38 diff -C2 -d -r1.37 -r1.38 *** BaseHrefTag.java 14 Jan 2004 02:53:46 -0000 1.37 --- BaseHrefTag.java 18 Mar 2004 04:04:08 -0000 1.38 *************** *** 89,93 **** if (null != page) { ! page.getLinkProcessor ().setBaseUrl (getBaseUrl ()); } } --- 89,93 ---- if (null != page) { ! page.setBaseUrl (getBaseUrl ()); } } |
From: Derrick O. <der...@us...> - 2004-03-18 04:13:47
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24099/src/org/htmlparser/lexer Modified Files: Page.java Log Message: Deprecate LinkProcessor. Functionality moved to Page. Index: Page.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** Page.java 31 Jan 2004 20:51:01 -0000 1.33 --- Page.java 18 Mar 2004 04:04:07 -0000 1.34 *************** *** 36,39 **** --- 36,40 ---- import java.lang.reflect.InvocationTargetException; import java.lang.reflect.Method; + import java.net.MalformedURLException; import java.net.URL; import java.net.URLConnection; *************** *** 41,45 **** import org.htmlparser.util.EncodingChangeException; - import org.htmlparser.util.LinkProcessor; import org.htmlparser.util.ParserException; --- 42,45 ---- *************** *** 75,78 **** --- 75,83 ---- /** + * The base URL for this page. + */ + protected String mBaseUrl; + + /** * The source of characters. */ *************** *** 90,99 **** /** - * The processor of relative links on this page. - * Holds any overridden base HREF. - */ - protected LinkProcessor mProcessor; - - /** * Messages for page not there (404). */ --- 95,98 ---- *************** *** 136,140 **** throw new IllegalArgumentException ("connection cannot be null"); setConnection (connection); ! mProcessor = null; } --- 135,139 ---- throw new IllegalArgumentException ("connection cannot be null"); setConnection (connection); ! mBaseUrl = null; } *************** *** 158,162 **** mConnection = null; mUrl = null; ! mProcessor = null; } --- 157,161 ---- mConnection = null; mUrl = null; ! mBaseUrl = null; } *************** *** 180,184 **** mConnection = null; mUrl = null; ! mProcessor = null; } --- 179,183 ---- mConnection = null; mUrl = null; ! mBaseUrl = null; } *************** *** 397,400 **** --- 396,417 ---- /** + * Gets the baseUrl. + * @return The base URL for this page, or <code>null</code> if not set. + */ + public String getBaseUrl () + { + return (mBaseUrl); + } + + /** + * Sets the baseUrl. + * @param url The base url for this page. + */ + public void setBaseUrl (String url) + { + mBaseUrl = url; + } + + /** * Get the source this page is reading from. */ *************** *** 720,741 **** /** ! * Get the link processor associated with this page. ! * @return The link processor that has the base HREF. */ ! public LinkProcessor getLinkProcessor () { ! if (null == mProcessor) ! mProcessor = new LinkProcessor (); ! ! return (mProcessor); } /** ! * Set the link processor associated with this page. ! * @param processor The new link processor for this page. */ ! public void setLinkProcessor (LinkProcessor processor) { ! mProcessor = processor; } --- 737,824 ---- /** ! * Build a URL from the link and base provided. ! * @param link The (relative) URI. ! * @param base The base URL of the page, either from the <BASE> tag ! * or, if none, the URL the page is being fetched from. ! * @return An absolute URL. */ ! public URL constructUrl (String link, String base) ! throws MalformedURLException { ! String path; ! boolean modified; ! boolean absolute; ! int index; ! URL url; // constructed URL combining relative link and base ! ! url = new URL (new URL (base), link); ! path = url.getFile (); ! modified = false; ! absolute = link.startsWith ("/"); ! if (!absolute) ! { // we prefer to fix incorrect relative links ! // this doesn't fix them all, just the ones at the start ! while (path.startsWith ("/.")) ! { ! if (path.startsWith ("/../")) ! { ! path = path.substring (3); ! modified = true; ! } ! else if (path.startsWith ("/./") || path.startsWith("/.")) ! { ! path = path.substring (2); ! modified = true; ! } ! else ! break; ! } ! } ! // fix backslashes ! while (-1 != (index = path.indexOf ("/\\"))) ! { ! path = path.substring (0, index + 1) + path.substring (index + 2); ! modified = true; ! } ! if (modified) ! url = new URL (url, path); ! ! return (url); } /** ! * Create an absolute URL from a relative link. ! * @param link The reslative portion of a URL. ! * @return The fully qualified URL or the original link if it was absolute ! * already or a failure occured. */ ! public String getAbsoluteURL (String link) { ! String base; ! URL url; ! String ret; ! ! if ((null == link) || ("".equals (link))) ! ret = ""; ! else ! try ! { ! base = getBaseUrl (); ! if (null == base) ! base = getUrl (); ! if (null == base) ! ret = link; ! else ! { ! url = constructUrl (link, base); ! ret = url.toExternalForm (); ! } ! } ! catch (MalformedURLException murle) ! { ! ret = link; ! } ! ! return (ret); } *************** *** 914,1022 **** } } - - // /** - // * The default charset. - // * This should be <code>ISO-8859-1</code>, - // * see RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) section 3.7.1 - // * Another alias is "8859_1". - // */ - // protected static final String DEFAULT_CHARSET = "ISO-8859-1"; - // - // /** - // * Trigger for charset detection. - // */ - // protected static final String CHARSET_STRING = "charset"; - // - // - // /** - // * Try and extract the character set from the HTTP header. - // * @param connection The connection with the charset info. - // * @return The character set name to use for this HTML page. - // */ - // protected String getCharacterSet (URLConnection connection) - // { - // final String field = "Content-Type"; - // - // String string; - // String ret; - // - // ret = DEFAULT_CHARSET; - // string = connection.getHeaderField (field); - // if (null != string) - // ret = getCharset (string); - // - // return (ret); - // } - // - // /** - // * Get a CharacterSet name corresponding to a charset parameter. - // * @param content A text line of the form: - // * <pre> - // * text/html; charset=Shift_JIS - // * </pre> - // * which is applicable both to the HTTP header field Content-Type and - // * the meta tag http-equiv="Content-Type". - // * Note this method also handles non-compliant quoted charset directives such as: - // * <pre> - // * text/html; charset="UTF-8" - // * </pre> - // * and - // * <pre> - // * text/html; charset='UTF-8' - // * </pre> - // * @return The character set name to use when reading the input stream. - // * For JDKs that have the Charset class this is qualified by passing - // * the name to findCharset() to render it into canonical form. - // * If the charset parameter is not found in the given string, the default - // * character set is returned. - // * @see ParserHelper#findCharset - // * @see #DEFAULT_CHARSET - // */ - // protected String getCharset(String content) - // { - // int index; - // String ret; - // - // ret = DEFAULT_CHARSET; - // if (null != content) - // { - // index = content.indexOf(CHARSET_STRING); - // - // if (index != -1) - // { - // content = content.substring(index + CHARSET_STRING.length()).trim(); - // if (content.startsWith("=")) - // { - // content = content.substring(1).trim(); - // index = content.indexOf(";"); - // if (index != -1) - // content = content.substring(0, index); - // - // //remove any double quotes from around charset string - // if (content.startsWith ("\"") && content.endsWith ("\"") && (1 < content.length ())) - // content = content.substring (1, content.length () - 1); - // - // //remove any single quote from around charset string - // if (content.startsWith ("'") && content.endsWith ("'") && (1 < content.length ())) - // content = content.substring (1, content.length () - 1); - // - // ret = ParserHelper.findCharset(content, ret); - // // Charset names are not case-sensitive; - // // that is, case is always ignored when comparing charset names. - // if (!ret.equalsIgnoreCase(content)) - // { - // feedback.info ( - // "detected charset \"" - // + content - // + "\", using \"" - // + ret - // + "\""); - // } - // } - // } - // } - // - // return (ret); - // } - // - --- 997,998 ---- |
From: Derrick O. <der...@us...> - 2004-03-18 04:13:46
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24099/src/org/htmlparser/tests/lexerTests Modified Files: PageTests.java Log Message: Deprecate LinkProcessor. Functionality moved to Page. Index: PageTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests/PageTests.java,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** PageTests.java 14 Jan 2004 02:53:47 -0000 1.16 --- PageTests.java 18 Mar 2004 04:04:08 -0000 1.17 *************** *** 51,54 **** --- 51,69 ---- /** + * Base URI for absolute URL tests. + */ + static final String BASEURI = "http://a/b/c/d;p?q"; + + /** + * Page for absolute URL tests. + */ + public static Page mPage; + static + { + mPage = new Page (); + mPage.setBaseUrl (BASEURI); + } + + /** * Test the third level page class. */ *************** *** 120,122 **** --- 135,415 ---- } } + + // + // Tests from Appendix C Examples of Resolving Relative URI References + // RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax + // T. Berners-Lee et al. + // http://www.ietf.org/rfc/rfc2396.txt + + // Within an object with a well-defined base URI of + // http://a/b/c/d;p?q + // the relative URI would be resolved as follows: + + // C.1. Normal Examples + // g:h = g:h + // g = http://a/b/c/g + // ./g = http://a/b/c/g + // g/ = http://a/b/c/g/ + // /g = http://a/g + // //g = http://g + // ?y = http://a/b/c/?y + // g?y = http://a/b/c/g?y + // #s = (current document)#s + // g#s = http://a/b/c/g#s + // g?y#s = http://a/b/c/g?y#s + // ;x = http://a/b/c/;x + // g;x = http://a/b/c/g;x + // g;x?y#s = http://a/b/c/g;x?y#s + // . = http://a/b/c/ + // ./ = http://a/b/c/ + // .. = http://a/b/ + // ../ = http://a/b/ + // ../g = http://a/b/g + // ../.. = http://a/ + // ../../ = http://a/ + // ../../g = http://a/g + + public void test1 () throws ParserException + { + assertEquals ("test1 failed", "https:h", mPage.getAbsoluteURL ("https:h")); + } + public void test2 () throws ParserException + { + assertEquals ("test2 failed", "http://a/b/c/g", mPage.getAbsoluteURL ("g")); + } + public void test3 () throws ParserException + { + assertEquals ("test3 failed", "http://a/b/c/g", mPage.getAbsoluteURL ("./g")); + } + public void test4 () throws ParserException + { + assertEquals ("test4 failed", "http://a/b/c/g/", mPage.getAbsoluteURL ("g/")); + } + public void test5 () throws ParserException + { + assertEquals ("test5 failed", "http://a/g", mPage.getAbsoluteURL ("/g")); + } + public void test6 () throws ParserException + { + assertEquals ("test6 failed", "http://g", mPage.getAbsoluteURL ("//g")); + } + public void test7 () throws ParserException + { + assertEquals ("test7 failed", "http://a/b/c/?y", mPage.getAbsoluteURL ("?y")); + } + public void test8 () throws ParserException + { + assertEquals ("test8 failed", "http://a/b/c/g?y", mPage.getAbsoluteURL ("g?y")); + } + public void test9 () throws ParserException + { + assertEquals ("test9 failed", "https:h", mPage.getAbsoluteURL ("https:h")); + } + public void test10 () throws ParserException + { + assertEquals ("test10 failed", "https:h", mPage.getAbsoluteURL ("https:h")); + } + // #s = (current document)#s + public void test11 () throws ParserException + { + assertEquals ("test11 failed", "http://a/b/c/g#s", mPage.getAbsoluteURL ("g#s")); + } + public void test12 () throws ParserException + { + assertEquals ("test12 failed", "http://a/b/c/g?y#s", mPage.getAbsoluteURL ("g?y#s")); + } + public void test13 () throws ParserException + { + assertEquals ("test13 failed", "http://a/b/c/;x", mPage.getAbsoluteURL (";x")); + } + public void test14 () throws ParserException + { + assertEquals ("test14 failed", "http://a/b/c/g;x", mPage.getAbsoluteURL ("g;x")); + } + public void test15 () throws ParserException + { + assertEquals ("test15 failed", "http://a/b/c/g;x?y#s", mPage.getAbsoluteURL ("g;x?y#s")); + } + public void test16 () throws ParserException + { + assertEquals ("test16 failed", "http://a/b/c/", mPage.getAbsoluteURL (".")); + } + public void test17 () throws ParserException + { + assertEquals ("test17 failed", "http://a/b/c/", mPage.getAbsoluteURL ("./")); + } + public void test18 () throws ParserException + { + assertEquals ("test18 failed", "http://a/b/", mPage.getAbsoluteURL ("..")); + } + public void test19 () throws ParserException + { + assertEquals ("test19 failed", "http://a/b/", mPage.getAbsoluteURL ("../")); + } + public void test20 () throws ParserException + { + assertEquals ("test20 failed", "http://a/b/g", mPage.getAbsoluteURL ("../g")); + } + public void test21 () throws ParserException + { + assertEquals ("test21 failed", "http://a/", mPage.getAbsoluteURL ("../..")); + } + public void test22 () throws ParserException + { + assertEquals ("test22 failed", "http://a/g", mPage.getAbsoluteURL ("../../g")); + } + + // C.2. Abnormal Examples + // Although the following abnormal examples are unlikely to occur in + // normal practice, all URI parsers should be capable of resolving them + // consistently. Each example uses the same base as above. + // + // An empty reference refers to the start of the current document. + // + // <> = (current document) + // + // Parsers must be careful in handling the case where there are more + // relative path ".." segments than there are hierarchical levels in the + // base URI's path. Note that the ".." syntax cannot be used to change + // the authority component of a URI. + // + // ../../../g = http://a/../g + // ../../../../g = http://a/../../g + // + // In practice, some implementations strip leading relative symbolic + // elements (".", "..") after applying a relative URI calculation, based + // on the theory that compensating for obvious author errors is better + // than allowing the request to fail. Thus, the above two references + // will be interpreted as "http://a/g" by some implementations. + // + // Similarly, parsers must avoid treating "." and ".." as special when + // they are not complete components of a relative path. + // + // /./g = http://a/./g + // /../g = http://a/../g + // g. = http://a/b/c/g. + // .g = http://a/b/c/.g + // g.. = http://a/b/c/g.. + // ..g = http://a/b/c/..g + // + // Less likely are cases where the relative URI uses unnecessary or + // nonsensical forms of the "." and ".." complete path segments. + // + // ./../g = http://a/b/g + // ./g/. = http://a/b/c/g/ + // g/./h = http://a/b/c/g/h + // g/../h = http://a/b/c/h + // g;x=1/./y = http://a/b/c/g;x=1/y + // g;x=1/../y = http://a/b/c/y + // + // All client applications remove the query component from the base URI + // before resolving relative URI. However, some applications fail to + // separate the reference's query and/or fragment components from a + // relative path before merging it with the base path. This error is + // rarely noticed, since typical usage of a fragment never includes the + // hierarchy ("/") character, and the query component is not normally + // used within relative references. + // + // g?y/./x = http://a/b/c/g?y/./x + // g?y/../x = http://a/b/c/g?y/../x + // g#s/./x = http://a/b/c/g#s/./x + // g#s/../x = http://a/b/c/g#s/../x + // + // Some parsers allow the scheme name to be present in a relative URI if + // it is the same as the base URI scheme. This is considered to be a + // loophole in prior specifications of partial URI [RFC1630]. Its use + // should be avoided. + // + // http:g = http:g ; for validating parsers + // | http://a/b/c/g ; for backwards compatibility + + // public void test23 () throws HTMLParserException + // { + // assertEquals ("test23 failed", "http://a/../g", mPage.getAbsoluteURL ("../../../g")); + // } + // public void test24 () throws HTMLParserException + // { + // assertEquals ("test24 failed", "http://a/../../g", mPage.getAbsoluteURL ("../../../../g")); + // } + public void test23 () throws ParserException + { + assertEquals ("test23 failed", "http://a/g", mPage.getAbsoluteURL ("../../../g")); + } + public void test24 () throws ParserException + { + assertEquals ("test24 failed", "http://a/g", mPage.getAbsoluteURL ("../../../../g")); + } + public void test25 () throws ParserException + { + assertEquals ("test25 failed", "http://a/./g", mPage.getAbsoluteURL ("/./g")); + } + public void test26 () throws ParserException + { + assertEquals ("test26 failed", "http://a/../g", mPage.getAbsoluteURL ("/../g")); + } + public void test27 () throws ParserException + { + assertEquals ("test27 failed", "http://a/b/c/g.", mPage.getAbsoluteURL ("g.")); + } + public void test28 () throws ParserException + { + assertEquals ("test28 failed", "http://a/b/c/.g", mPage.getAbsoluteURL (".g")); + } + public void test29 () throws ParserException + { + assertEquals ("test29 failed", "http://a/b/c/g..", mPage.getAbsoluteURL ("g..")); + } + public void test30 () throws ParserException + { + assertEquals ("test30 failed", "http://a/b/c/..g", mPage.getAbsoluteURL ("..g")); + } + public void test31 () throws ParserException + { + assertEquals ("test31 failed", "http://a/b/g", mPage.getAbsoluteURL ("./../g")); + } + public void test32 () throws ParserException + { + assertEquals ("test32 failed", "http://a/b/c/g/", mPage.getAbsoluteURL ("./g/.")); + } + public void test33 () throws ParserException + { + assertEquals ("test33 failed", "http://a/b/c/g/h", mPage.getAbsoluteURL ("g/./h")); + } + public void test34 () throws ParserException + { + assertEquals ("test34 failed", "http://a/b/c/h", mPage.getAbsoluteURL ("g/../h")); + } + public void test35 () throws ParserException + { + assertEquals ("test35 failed", "http://a/b/c/g;x=1/y", mPage.getAbsoluteURL ("g;x=1/./y")); + } + public void test36 () throws ParserException + { + assertEquals ("test36 failed", "http://a/b/c/y", mPage.getAbsoluteURL ("g;x=1/../y")); + } + public void test37 () throws ParserException + { + assertEquals ("test37 failed", "http://a/b/c/g?y/./x", mPage.getAbsoluteURL ("g?y/./x")); + } + public void test38 () throws ParserException + { + assertEquals ("test38 failed", "http://a/b/c/g?y/../x", mPage.getAbsoluteURL ("g?y/../x")); + } + public void test39 () throws ParserException + { + assertEquals ("test39 failed", "http://a/b/c/g#s/./x", mPage.getAbsoluteURL ("g#s/./x")); + } + public void test40 () throws ParserException + { + assertEquals ("test40 failed", "http://a/b/c/g#s/../x", mPage.getAbsoluteURL ("g#s/../x")); + } + // public void test41 () throws HTMLParserException + // { + // assertEquals ("test41 failed", "http:g", mPage.getAbsoluteURL ("http:g")); + // } + public void test41 () throws ParserException + { + assertEquals ("test41 failed", "http://a/b/c/g", mPage.getAbsoluteURL ("http:g")); + } + } \ No newline at end of file |
From: Derrick O. <der...@us...> - 2004-03-18 04:13:46
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24099/src/org/htmlparser/util Modified Files: LinkProcessor.java Log Message: Deprecate LinkProcessor. Functionality moved to Page. Index: LinkProcessor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/LinkProcessor.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** LinkProcessor.java 2 Jan 2004 16:24:58 -0000 1.33 --- LinkProcessor.java 18 Mar 2004 04:04:08 -0000 1.34 *************** *** 33,36 **** --- 33,37 ---- /** * Processor class for links, is present basically as a utility class. + * @deprecated Use a Page object instead. */ public class LinkProcessor *************** *** 57,60 **** --- 58,62 ---- * @param base The base URL unless overridden by the current baseURL property. * @return The fully qualified URL or the original link if a failure occured. + * @deprecated Use Page.getAbsoluteURL() instead. */ public String extract (String link, String base) *************** *** 91,99 **** public String stripQuotes (String string) { ! //remove any double quotes from around charset string if (string.startsWith ("\"") && string.endsWith ("\"") && (1 < string.length ())) string = string.substring (1, string.length () - 1); ! //remove any single quote from around charset string if (string.startsWith ("'") && string.endsWith ("'") && (1 < string.length ())) string = string.substring (1, string.length () - 1); --- 93,101 ---- public String stripQuotes (String string) { ! // remove any double quotes from around string if (string.startsWith ("\"") && string.endsWith ("\"") && (1 < string.length ())) string = string.substring (1, string.length () - 1); ! // remove any single quote from around string if (string.startsWith ("'") && string.endsWith ("'") && (1 < string.length ())) string = string.substring (1, string.length () - 1); *************** *** 101,105 **** return (string); } ! public URL constructUrl(String link, String base) throws MalformedURLException { --- 103,110 ---- return (string); } ! ! /** ! * @deprecated Use Page.constructUrl() instead. ! */ public URL constructUrl(String link, String base) throws MalformedURLException { *************** *** 140,143 **** --- 145,149 ---- * @param url The url containing spaces. * @return The URL with spaces as %20 sequences. + * @deprecated Use Parser.fixSpaces() instead. */ public static String fixSpaces (String url) *************** *** 208,211 **** --- 214,220 ---- } + /** + * @deprecated Removing the last slash from a URL is a bad idea. + */ public static String removeLastSlash(String baseUrl) { if(baseUrl.charAt(baseUrl.length()-1)=='/') |
From: Derrick O. <der...@us...> - 2004-03-18 04:13:46
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24099/src/org/htmlparser/tests/tagTests Modified Files: BaseHrefTagTest.java Log Message: Deprecate LinkProcessor. Functionality moved to Page. Index: BaseHrefTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/BaseHrefTagTest.java,v retrieving revision 1.39 retrieving revision 1.40 diff -C2 -d -r1.39 -r1.40 *** BaseHrefTagTest.java 14 Jan 2004 02:53:47 -0000 1.39 --- BaseHrefTagTest.java 18 Mar 2004 04:04:08 -0000 1.40 *************** *** 33,37 **** import org.htmlparser.tags.TitleTag; import org.htmlparser.tests.ParserTestCase; - import org.htmlparser.util.LinkProcessor; import org.htmlparser.util.ParserException; --- 33,36 ---- *************** *** 53,65 **** } - public void testRemoveLastSlash() { - String url1 = "http://www.yahoo.com/"; - String url2 = "http://www.google.com"; - String modifiedUrl1 = LinkProcessor.removeLastSlash(url1); - String modifiedUrl2 = LinkProcessor.removeLastSlash(url2); - assertEquals("Url1","http://www.yahoo.com",modifiedUrl1); - assertEquals("Url2","http://www.google.com",modifiedUrl2); - } - public void testScan() throws ParserException{ createParser("<html><head><TITLE>test page</TITLE><BASE HREF=\"http://www.abc.com/\"><a href=\"home.cfm\">Home</a>...</html>","http://www.google.com/test/index.html"); --- 52,55 ---- |
From: Derrick O. <der...@us...> - 2004-03-18 04:13:46
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/utilTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24099/src/org/htmlparser/tests/utilTests Modified Files: AllTests.java Removed Files: HTMLLinkProcessorTest.java Log Message: Deprecate LinkProcessor. Functionality moved to Page. --- HTMLLinkProcessorTest.java DELETED --- Index: AllTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/utilTests/AllTests.java,v retrieving revision 1.54 retrieving revision 1.55 diff -C2 -d -r1.54 -r1.55 *** AllTests.java 2 Jan 2004 16:24:57 -0000 1.54 --- AllTests.java 18 Mar 2004 04:04:08 -0000 1.55 *************** *** 63,67 **** suite.addTestSuite(BeanTest.class); suite.addTestSuite(CharacterTranslationTest.class); - suite.addTestSuite(HTMLLinkProcessorTest.class); suite.addTestSuite(HTMLParserUtilsTest.class); suite.addTestSuite(NodeListTest.class); --- 63,66 ---- |
From: Derrick O. <der...@us...> - 2004-03-15 23:00:13
|
Update of /cvsroot/htmlparser/htmlparser/src/doc-files In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv28448 Modified Files: building.html Log Message: Update build instruction problem identified by sarsie. Index: building.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/doc-files/building.html,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** building.html 16 Dec 2003 02:29:56 -0000 1.1 --- building.html 15 Mar 2004 22:50:55 -0000 1.2 *************** *** 67,72 **** href="http://eclipse.org/">Eclipse</a>. Mount the org directory where the HTML Parser was installed along with the ! <code>junit.jar</code> file from the <code>lib</code> directory. "Build All" ! should work. <H2>CVS</H2> The most recent files are only available via CVS: --- 67,73 ---- href="http://eclipse.org/">Eclipse</a>. Mount the org directory where the HTML Parser was installed along with the ! <code>junit.jar</code> file from the <code>lib</code> directory, and the ! <code>tools.jar</code> file from the java JDK lib directory ! <code>[where java is installed]/lib/tools.jar<code>. "Build All" should work. <H2>CVS</H2> The most recent files are only available via CVS: |
From: <der...@us...> - 2004-03-14 20:40:39
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11828/src/org/htmlparser/lexer/nodes Modified Files: Attribute.java TagNode.java Log Message: Remove requirement for Translate.class to be in htmllexer.jar. Index: Attribute.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/Attribute.java,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** Attribute.java 9 Feb 2004 02:09:44 -0000 1.18 --- Attribute.java 14 Mar 2004 20:31:38 -0000 1.19 *************** *** 580,584 **** // references, so convert all double quotes into " quote = '"'; ! ref = Translate.encode (quote); // JDK 1.4: value = value.replaceAll ("\"", ref); buffer = new StringBuffer (value.length() * 5); --- 580,584 ---- // references, so convert all double quotes into " quote = '"'; ! ref = """; // Translate.encode (quote); // JDK 1.4: value = value.replaceAll ("\"", ref); buffer = new StringBuffer (value.length() * 5); *************** *** 586,590 **** { ch = value.charAt (i); ! if ('"' == ch) buffer.append (ref); else --- 586,590 ---- { ch = value.charAt (i); ! if (quote == ch) buffer.append (ref); else Index: TagNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/TagNode.java,v retrieving revision 1.31 retrieving revision 1.32 diff -C2 -d -r1.31 -r1.32 *** TagNode.java 28 Feb 2004 15:52:43 -0000 1.31 --- TagNode.java 14 Mar 2004 20:31:38 -0000 1.32 *************** *** 186,190 **** // convert all double quotes into " quote = '"'; ! ref = Translate.encode (quote); // JDK 1.4: value = value.replaceAll ("\"", ref); buffer = new StringBuffer (value.length() * 5); --- 186,190 ---- // convert all double quotes into " quote = '"'; ! ref = """; // Translate.encode (quote); // JDK 1.4: value = value.replaceAll ("\"", ref); buffer = new StringBuffer (value.length() * 5); *************** *** 192,196 **** { ch = value.charAt (i); ! if ('"' == ch) buffer.append (ref); else --- 192,196 ---- { ch = value.charAt (i); ! if (quote == ch) buffer.append (ref); else |
From: <der...@us...> - 2004-03-14 20:40:38
|
Update of /cvsroot/htmlparser/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11828 Modified Files: build.xml Log Message: Remove requirement for Translate.class to be in htmllexer.jar. Index: build.xml =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/build.xml,v retrieving revision 1.60 retrieving revision 1.61 diff -C2 -d -r1.60 -r1.61 *** build.xml 29 Feb 2004 15:09:56 -0000 1.60 --- build.xml 14 Mar 2004 20:31:38 -0000 1.61 *************** *** 236,240 **** <include name="org/htmlparser/util/SpecialHashtable.class"/> <include name="org/htmlparser/util/LinkProcessor.class"/> - <include name="org/htmlparser/util/Translate.class"/> <include name="org/htmlparser/util/EncodingChangeException.class"/> <include name="org/htmlparser/util/sort/**/*.class"/> --- 236,239 ---- |
From: <der...@us...> - 2004-03-14 16:40:39
|
Update of /cvsroot/htmlparser/htmlparser/docs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv32369/docs Modified Files: changes.txt release.txt Log Message: Update version to 1.4 final release. Index: changes.txt =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/changes.txt,v retrieving revision 1.198 retrieving revision 1.199 diff -C2 -d -r1.198 -r1.199 *** changes.txt 29 Feb 2004 16:48:28 -0000 1.198 --- changes.txt 14 Mar 2004 16:31:39 -0000 1.199 *************** *** 13,16 **** --- 13,34 ---- ******************************************************************************* + Release Build 1.4 - 20040314 + -------------------------------- + + 2004-03-14 10:53 derrickoswald + + * src/org/htmlparser/beans/LinkBean.java: + + Add retry on EncodingChangeException, just like StringBean. + + 2004-03-14 10:42 derrickoswald + + * src/org/htmlparser/: tests/lexerTests/AttributeTests.java, + lexer/nodes/PageAttribute.java: + + Fix bug #911565 isValued() and isNull() don't work. + Rework predicates. Add testPredicates() to attribute tests. + Don't rely on value of quote character when getting assignment string. Add testSetQuote(). + Integration Build 1.4 - 20040229 -------------------------------- Index: release.txt =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/release.txt,v retrieving revision 1.57 retrieving revision 1.58 diff -C2 -d -r1.57 -r1.58 *** release.txt 29 Feb 2004 16:48:28 -0000 1.57 --- release.txt 14 Mar 2004 16:31:40 -0000 1.58 *************** *** 1,3 **** ! HTMLParser Version 1.4 (Integration Build Feb 29, 2004) ********************************************* --- 1,3 ---- ! HTMLParser Version 1.4 (Release Build Mar 14, 2004) ********************************************* *************** *** 69,75 **** Bug Fixes --------- ! 900125 Style Tag Children not grouped ! 900128 RemarkNode.setText() does not set Text 902121 StringBean throws NullPointerException. 899413 bug in javascript end detection. 891058 Bug in lexer --- 69,76 ---- Bug Fixes --------- ! 911565 isValued() and isEmpty() don't work 902121 StringBean throws NullPointerException. + 900128 RemarkNode.setText() does not set Text + 900125 Style Tag Children not grouped 899413 bug in javascript end detection. 891058 Bug in lexer *************** *** 138,141 **** --- 139,143 ---- [29] Nick Burch [30] Gernot Fricke + [31] Anthony Labarre If you find any bugs, please go to |
From: <der...@us...> - 2004-03-14 16:40:39
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv32369/src/org/htmlparser Modified Files: Parser.java Log Message: Update version to 1.4 final release. Index: Parser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Parser.java,v retrieving revision 1.88 retrieving revision 1.89 diff -C2 -d -r1.88 -r1.89 *** Parser.java 29 Feb 2004 16:48:43 -0000 1.88 --- Parser.java 14 Mar 2004 16:31:40 -0000 1.89 *************** *** 81,85 **** */ public final static String ! VERSION_TYPE = "Integration Build" ; --- 81,85 ---- */ public final static String ! VERSION_TYPE = "Release Build" ; *************** *** 88,92 **** */ public final static String ! VERSION_DATE = "Feb 29, 2004" ; --- 88,92 ---- */ public final static String ! VERSION_DATE = "Mar 14, 2004" ; |
From: <der...@us...> - 2004-03-14 16:02:04
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24953 Modified Files: LinkBean.java Log Message: Add retry on EncodingChangeException, just like StringBean. Index: LinkBean.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/LinkBean.java,v retrieving revision 1.27 retrieving revision 1.28 diff -C2 -d -r1.27 -r1.28 *** LinkBean.java 4 Jan 2004 03:23:09 -0000 1.27 --- LinkBean.java 14 Mar 2004 15:53:06 -0000 1.28 *************** *** 38,41 **** --- 38,42 ---- import org.htmlparser.Parser; import org.htmlparser.tags.LinkTag; + import org.htmlparser.util.EncodingChangeException; import org.htmlparser.util.ParserException; import org.htmlparser.visitors.ObjectFindingVisitor; *************** *** 71,75 **** protected Parser mParser; ! /** Creates new StringBean */ public LinkBean () { --- 72,76 ---- protected Parser mParser; ! /** Creates new LinkBean */ public LinkBean () { *************** *** 86,89 **** --- 87,91 ---- { Parser parser; + ObjectFindingVisitor visitor; Vector vector; LinkTag link; *************** *** 91,96 **** parser = new Parser (url); ! ObjectFindingVisitor visitor = new ObjectFindingVisitor(LinkTag.class); ! parser.visitAllNodesWith(visitor); Node [] nodes = visitor.getTags(); vector = new Vector(); --- 93,107 ---- parser = new Parser (url); ! visitor = new ObjectFindingVisitor (LinkTag.class); ! try ! { ! parser.visitAllNodesWith (visitor); ! } ! catch (EncodingChangeException ece) ! { ! parser.reset (); ! visitor = new ObjectFindingVisitor (LinkTag.class); ! parser.visitAllNodesWith (visitor); ! } Node [] nodes = visitor.getTags(); vector = new Vector(); *************** *** 273,287 **** } ! // /** ! // * Unit test. ! // */ ! // public static void main (String[] args) ! // { ! // LinkBean lb = new LinkBean (); ! // lb.setURL ("http://cbc.ca"); ! // URL[] urls = lb.getLinks (); ! // for (int i = 0; i < urls.length; i++) ! // System.out.println (urls[i]); ! // } } --- 284,304 ---- } ! /** ! * Unit test. ! * @param args Pass arg[0] as the URL to process. ! */ ! public static void main (String[] args) ! { ! if (0 >= args.length) ! System.out.println ("Usage: java -classpath htmlparser.jar org.htmlparser.beans.LinkBean <http://whatever_url>"); ! else ! { ! LinkBean lb = new LinkBean (); ! lb.setURL (args[0]); ! URL[] urls = lb.getLinks (); ! for (int i = 0; i < urls.length; i++) ! System.out.println (urls[i]); ! } ! } } |
From: <der...@us...> - 2004-03-14 15:51:36
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23217/lexer/nodes Modified Files: PageAttribute.java Log Message: Fix bug #911565 isValued() and isNull() don't work. Rework predicates. Add testPredicates() to attribute tests. Don't rely on value of quote character when getting assignment string. Add testSetQuote(). Index: PageAttribute.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/PageAttribute.java,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** PageAttribute.java 2 Jan 2004 16:24:53 -0000 1.7 --- PageAttribute.java 14 Mar 2004 15:42:38 -0000 1.8 *************** *** 240,244 **** public String getAssignment () { - int end; String ret; --- 240,243 ---- *************** *** 248,255 **** if ((null != mPage) && (0 <= mNameEnd) && (0 <= mValueStart)) { ! end = mValueStart; ! if (0 != getQuote ()) ! end--; ! ret = mPage.getText (mNameEnd, end); setAssignment (ret); // cache the value } --- 247,255 ---- if ((null != mPage) && (0 <= mNameEnd) && (0 <= mValueStart)) { ! ret = mPage.getText (mNameEnd, mValueStart); ! // remove a possible quote included in the assignment ! // since mValueStart points at the real start of the value ! if (ret.endsWith ("\"") || ret.endsWith ("'")) ! ret = ret.substring (0, ret.length () - 1); setAssignment (ret); // cache the value } *************** *** 266,270 **** public void getAssignment (StringBuffer buffer) { ! int end; String assignment; --- 266,271 ---- public void getAssignment (StringBuffer buffer) { ! int length; ! char ch; String assignment; *************** *** 274,281 **** if ((null != mPage) && (0 <= mNameEnd) && (0 <= mValueStart)) { ! end = mValueStart; ! if (0 != getQuote ()) ! end--; ! mPage.getText (buffer, mNameEnd, end); } } --- 275,285 ---- if ((null != mPage) && (0 <= mNameEnd) && (0 <= mValueStart)) { ! mPage.getText (buffer, mNameEnd, mValueStart); ! // remove a possible quote included in the assignment ! // since mValueStart points at the real start of the value ! length = buffer.length () - 1; ! ch = buffer.charAt (length); ! if (('\'' == ch) || ('"' == ch)) ! buffer.setLength (length); } } *************** *** 507,512 **** public boolean isStandAlone () { ! return ((null != super.getName ()) && (null == super.getAssignment ()) ! || ((null != mPage) && (0 <= mNameEnd) && (0 > mValueStart))); } --- 511,520 ---- public boolean isStandAlone () { ! return (!isWhitespace () // not whitespace ! && (null == super.getAssignment ()) // and no explicit assignment provided ! && !isValued () // and has no value ! && ((null == mPage) // and either its not coming from a page ! // or it is coming from a page and it doesn't have an assignment part ! || ((null != mPage) && (0 <= mNameEnd) && (0 > mValueStart)))); } *************** *** 518,523 **** public boolean isEmpty () { ! return (((null != super.getAssignment ()) && (null == super.getValue ())) ! || ((null != mPage) && ((0 <= mValueStart) && (0 > mValueEnd)))); } --- 526,535 ---- public boolean isEmpty () { ! return (!isWhitespace () // not whitespace ! && !isStandAlone () // and not standalone ! && (null == super.getValue ()) // and no explicit value provided ! && ((null == mPage) // and either its not coming from a page ! // or it is coming from a page and has no value ! || ((null != mPage) && (0 > mValueEnd)))); } *************** *** 529,534 **** public boolean isValued () { ! return ((null != super.getValue ()) ! || ((null != mPage) && ((0 <= mValueStart) && (0 <= mValueEnd)))); } --- 541,547 ---- public boolean isValued () { ! return ((null != super.getValue ()) // an explicit value provided ! // or it is coming from a page and has a non-empty value ! || ((null != mPage) && ((0 <= mValueStart) && (0 <= mValueEnd)) && (mValueStart != mValueEnd))); } |
From: <der...@us...> - 2004-03-14 15:51:36
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23217/tests/lexerTests Modified Files: AttributeTests.java Log Message: Fix bug #911565 isValued() and isNull() don't work. Rework predicates. Add testPredicates() to attribute tests. Don't rely on value of quote character when getting assignment string. Add testSetQuote(). Index: AttributeTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests/AttributeTests.java,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** AttributeTests.java 7 Feb 2004 12:53:09 -0000 1.12 --- AttributeTests.java 14 Mar 2004 15:42:37 -0000 1.13 *************** *** 626,628 **** --- 626,750 ---- assertStringEquals ("toHtml()", html, img.toHtml ()); } + + /** + * see bug #911565 isValued() and isNull() don't work + */ + public void testPredicates () throws ParserException + { + String html1 = "<img alt=\"\" src=\"images/third\" readonly>"; + String html2 = "<img src=\"images/third\" readonly alt=\"\">"; + String html3 = "<img readonly alt=\"\" src=\"images/third\">"; + String htmls[] = { html1, html2, html3 }; + + for (int i = 0; i < htmls.length; i++) + { + createParser (htmls[i]); + parseAndAssertNodeCount (1); + assertTrue ("Node should be an ImageTag", node[0] instanceof ImageTag); + ImageTag img = (ImageTag)node[0]; + Attribute src = img.getAttributeEx ("src"); + Attribute alt = img.getAttributeEx ("alt"); + Attribute readonly = img.getAttributeEx ("readonly"); + assertTrue ("src whitespace", !src.isWhitespace ()); + assertTrue ("src not valued", src.isValued ()); + assertTrue ("src empty", !src.isEmpty ()); + assertTrue ("src standalone", !src.isStandAlone ()); + assertTrue ("alt whitespace", !alt.isWhitespace ()); + assertTrue ("alt valued", !alt.isValued ()); + assertTrue ("alt empty", !alt.isEmpty ()); + assertTrue ("alt standalone", !alt.isStandAlone ()); + assertTrue ("readonly whitespace", !readonly.isWhitespace ()); + assertTrue ("readonly valued", !readonly.isValued ()); + assertTrue ("readonly empty", !readonly.isEmpty ()); + assertTrue ("readonly not standalone", readonly.isStandAlone ()); + // try assigning the name and checking again + src.setName ("SRC"); + assertTrue ("setName() failed", "SRC=\"images/third\"".equals (src.toString ())); + assertTrue ("src whitespace", !src.isWhitespace ()); + assertTrue ("src not valued", src.isValued ()); + assertTrue ("src empty", !src.isEmpty ()); + assertTrue ("src standalone", !src.isStandAlone ()); + alt.setName ("ALT"); + assertTrue ("setName() failed", "ALT=\"\"".equals (alt.toString ())); + assertTrue ("alt whitespace", !alt.isWhitespace ()); + assertTrue ("alt valued", !alt.isValued ()); + assertTrue ("alt empty", !alt.isEmpty ()); + assertTrue ("alt standalone", !alt.isStandAlone ()); + readonly.setName ("READONLY"); + assertTrue ("setName() failed", "READONLY".equals (readonly.toString ())); + assertTrue ("readonly whitespace", !readonly.isWhitespace ()); + assertTrue ("readonly valued", !readonly.isValued ()); + assertTrue ("readonly empty", !readonly.isEmpty ()); + assertTrue ("readonly not standalone", readonly.isStandAlone ()); + // try assigning the assignment and checking again + src.setAssignment (" = "); + assertTrue ("setAssignment() failed", "SRC = \"images/third\"".equals (src.toString ())); + assertTrue ("src whitespace", !src.isWhitespace ()); + assertTrue ("src not valued", src.isValued ()); + assertTrue ("src empty", !src.isEmpty ()); + assertTrue ("src standalone", !src.isStandAlone ()); + alt.setAssignment (" = "); + assertTrue ("setAssignment() failed", "ALT = \"\"".equals (alt.toString ())); + assertTrue ("alt whitespace", !alt.isWhitespace ()); + assertTrue ("alt valued", !alt.isValued ()); + assertTrue ("alt empty", !alt.isEmpty ()); + assertTrue ("alt standalone", !alt.isStandAlone ()); + readonly.setAssignment ("="); + assertTrue ("setAssignment() failed", "READONLY=".equals (readonly.toString ())); + assertTrue ("readonly whitespace", !readonly.isWhitespace ()); + assertTrue ("readonly valued", !readonly.isValued ()); + assertTrue ("readonly not empty", readonly.isEmpty ()); + assertTrue ("readonly standalone", !readonly.isStandAlone ()); + // try assigning the value and checking again + createParser (htmls[i]); + parseAndAssertNodeCount (1); + assertTrue ("Node should be an ImageTag", node[0] instanceof ImageTag); + img = (ImageTag)node[0]; + src = img.getAttributeEx ("src"); + alt = img.getAttributeEx ("alt"); + readonly = img.getAttributeEx ("readonly"); + src.setValue ("cgi-bin/redirect"); + assertTrue ("setValue() failed", "src=\"cgi-bin/redirect\"".equals (src.toString ())); + assertTrue ("src whitespace", !src.isWhitespace ()); + assertTrue ("src not valued", src.isValued ()); + assertTrue ("src empty", !src.isEmpty ()); + assertTrue ("src standalone", !src.isStandAlone ()); + alt.setValue ("no image"); + assertTrue ("setValue() failed", "alt=\"no image\"".equals (alt.toString ())); + assertTrue ("alt whitespace", !alt.isWhitespace ()); + assertTrue ("alt not valued", alt.isValued ()); + assertTrue ("alt empty", !alt.isEmpty ()); + assertTrue ("alt standalone", !alt.isStandAlone ()); + readonly.setValue ("true"); // this may be bogus, really need to set assignment too, see below + assertTrue ("setValue() failed", "readonlytrue".equals (readonly.toString ())); + assertTrue ("readonly whitespace", !readonly.isWhitespace ()); + assertTrue ("readonly not valued", readonly.isValued ()); + assertTrue ("readonly empty", !readonly.isEmpty ()); + assertTrue ("readonly standalone", !readonly.isStandAlone ()); + readonly.setAssignment ("="); + assertTrue ("setAssignment() failed", "readonly=true".equals (readonly.toString ())); + assertTrue ("readonly whitespace", !readonly.isWhitespace ()); + assertTrue ("readonly not valued", readonly.isValued ()); + assertTrue ("readonly empty", !readonly.isEmpty ()); + assertTrue ("readonly standalone", !readonly.isStandAlone ()); + } + } + + /** + * see bug #911565 isValued() and isNull() don't work + */ + public void testSetQuote () throws ParserException + { + String html = "<img alt=\"\" src=\"images/third\" toast>"; + + createParser (html); + parseAndAssertNodeCount (1); + assertTrue ("Node should be an ImageTag", node[0] instanceof ImageTag); + ImageTag img = (ImageTag)node[0]; + Attribute src = img.getAttributeEx ("src"); + src.setQuote ('\0'); + assertTrue ("setQuote('\\0') failed", "src=images/third".equals (src.toString ())); + src.setQuote ('\''); + assertTrue ("setQuote('\\'') failed", "src='images/third'".equals (src.toString ())); + } } |
From: <der...@us...> - 2004-02-29 17:07:10
|
Update of /cvsroot/htmlparser/htmlparser/docs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3933/docs Modified Files: changes.txt release.txt Log Message: Update version to 1.4-20040229 Index: changes.txt =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/changes.txt,v retrieving revision 1.197 retrieving revision 1.198 diff -C2 -d -r1.197 -r1.198 *** changes.txt 16 Feb 2004 22:46:07 -0000 1.197 --- changes.txt 29 Feb 2004 16:48:28 -0000 1.198 *************** *** 13,16 **** --- 13,78 ---- ******************************************************************************* + Integration Build 1.4 - 20040229 + -------------------------------- + + 2004-02-29 10:09 derrickoswald + + * build.xml, src/doc-files/overview.html, + src/org/htmlparser/parserapplications/StringExtractor.java, + src/org/htmlparser/nodeDecorators/package.html, + src/org/htmlparser/tags/CompositeTag.java: + + Javadoc changes. + Fix the "low hanging fruit" javadoc issues. + + 2004-02-29 09:16 derrickoswald + + * src/org/htmlparser/: scanners/StyleScanner.java, + tags/StyleTag.java, tests/tagTests/StyleTagTest.java: + + Fix bug #900125 Style Tag Children not grouped + Added StyleScanner, a near copy of ScriptScanner. + Added testStyleChildren() in StyleTagTest to check it's operation. + + 2004-02-29 07:52 derrickoswald + + * src/org/htmlparser/: lexer/nodes/RemarkNode.java, + lexer/nodes/StringNode.java, tags/ImageTag.java, tags/LinkTag.java, + tests/ParserTest.java: + + Fix bug #900128 RemarkNode.setText() does not set Text + Add override setText() to StringNode and RemarkNode. + Add unit tests to excercise the new code. + Remove remaining XX_FILTER constants. + + 2004-02-28 20:38 derrickoswald + + * src/org/htmlparser/tags/ScriptTag.java: + + Correct booboo in ScriptTag toHtml() injected by fix to bug #902121. + + 2004-02-28 10:52 derrickoswald + + * src/org/htmlparser/: beans/StringBean.java, filters/package.html, + lexer/nodes/TagNode.java, scanners/ScriptDecoder.java, + scanners/ScriptScanner.java, tags/ScriptTag.java, tags/Tag.java, + tests/ParserTestCase.java, + tests/scannersTests/ScriptScannerTest.java: + + Fix bug #902121 StringBean throws NullPointerException. + Added ScriptDecoder to handle Microsoft Script Encoder encrypted tags. + Added accessor to ScriptTag's scriptCode property to be able to override it. + Ensured that a Tag always has a non-null name. + Skip STYLE tags in StringBean, just like SCRIPT. + + 2004-02-18 07:34 derrickoswald + + * src/org/htmlparser/: lexer/Lexer.java, + tests/lexerTests/LexerTests.java: + + Fix bug #899413 bug in javascript end detection. + Patch submitted by Gernot Fricke handles escaped quotes in strings when + lexing with smartquote turned on. Added test case in LexerTests. + Integration Build 1.4 - 20040216 -------------------------------- Index: release.txt =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/release.txt,v retrieving revision 1.56 retrieving revision 1.57 diff -C2 -d -r1.56 -r1.57 *** release.txt 16 Feb 2004 22:46:08 -0000 1.56 --- release.txt 29 Feb 2004 16:48:28 -0000 1.57 *************** *** 1,3 **** ! HTMLParser Version 1.4 (Integration Build Feb 16, 2004) ********************************************* --- 1,3 ---- ! HTMLParser Version 1.4 (Integration Build Feb 29, 2004) ********************************************* *************** *** 57,60 **** --- 57,62 ---- JDK stack, eliminating most StackOverflow exceptions. Also, a CompositeTag's "startTag()" is "this", and the CompositeTagScanner just adds children. + The ScriptScanner will now decrypt Microsoft Script Encoder encrypted script + tags. The plaintext is available via ScriptTag.getScriptCode(). Filters A new powerful filtering capability has been added, which makes extracting *************** *** 67,70 **** --- 69,76 ---- Bug Fixes --------- + 900125 Style Tag Children not grouped + 900128 RemarkNode.setText() does not set Text + 902121 StringBean throws NullPointerException. + 899413 bug in javascript end detection. 891058 Bug in lexer 865279 Documentation |
From: <der...@us...> - 2004-02-29 17:06:56
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3933/src/org/htmlparser Modified Files: Parser.java Log Message: Update version to 1.4-20040229 Index: Parser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Parser.java,v retrieving revision 1.87 retrieving revision 1.88 diff -C2 -d -r1.87 -r1.88 *** Parser.java 16 Feb 2004 22:46:08 -0000 1.87 --- Parser.java 29 Feb 2004 16:48:43 -0000 1.88 *************** *** 88,92 **** */ public final static String ! VERSION_DATE = "Feb 16, 2004" ; --- 88,92 ---- */ public final static String ! VERSION_DATE = "Feb 29, 2004" ; |
From: <der...@us...> - 2004-02-29 17:06:56
|
Update of /cvsroot/htmlparser/htmlparser/docs/docs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3933/docs/docs Modified Files: SamplePrograms.html UsingCookiesWithParser.html Added Files: RSSFeeds.html Log Message: Update version to 1.4-20040229 --- NEW FILE: RSSFeeds.html --- <html><head><title>RSSFeeds</title></head><body> <div class="wikitext"> <p><b>Parsing RSS Feeds <p>Out of the box, the parser only understands XML tags that have the same name as HTML tags. So this example: <pre> import org.htmlparser.Parser; import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.tags.TitleTag; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; /* * RSS (RDF Site Summary - formerly called Rich Site Summary) is a method of * describing news or other Web content that is available for "feeding" * (distribution or syndication) from an online publisher to Web users. * RSS is an application of the Extensible Markup Language (XML) that adheres * to the World Wide Web Consortium's Resource Description Framework (RDF). * Originally developed by Netscape for its browser's Netcenter channels, * the RSS specification is now available for anyone to use. */ public class ResourceDescriptionFrameworkSiteSummary { public static void main (String[] args) throws ParserException { Parser parser; NodeList list; parser = new Parser ("http://sourceforge.net/export/rss2_sftopstats.php?feed=mostactive_weekly"); list = parser.extractAllNodesThatMatch (new NodeClassFilter (TitleTag.class)); for (NodeIterator iterator = list.elements (); iterator.hasMoreNodes (); ) System.out.println (iterator.nextNode ().toPlainTextString ()); } } <p>Will only find the TITLE tags, which may be what we want: <pre> Rank 1: Gaim (100% activity) Rank 2: Azureus - BitTorrent Client (99.9934% activity) Rank 3: eGroupWare: Enterprise Collaboration (99.9867% activity) Rank 4: WinMerge (99.9801% activity) Rank 5: phpMyAdmin (99.9735% activity) Rank 6: guliverkli (99.9668% activity) Rank 7: phpGedView (99.9602% activity) Rank 8: AMSN (99.9536% activity) Rank 9: dotproject (99.9469% activity) Rank 10: ScummVM (99.9403% activity) <p>However, with some custom tags defined, it can handle the heirarchy of the XML: <pre> import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.tags.CompositeTag; import org.htmlparser.tags.Tag; import org.htmlparser.tags.TitleTag; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; /* * RSS (RDF Site Summary - formerly called Rich Site Summary) is a method of * describing news or other Web content that is available for "feeding" * (distribution or syndication) from an online publisher to Web users. * RSS is an application of the Extensible Markup Language (XML) that adheres * to the World Wide Web Consortium's Resource Description Framework (RDF). * Originally developed by Netscape for its browser's Netcenter channels, * the RSS specification is now available for anyone to use. */ class Item extends CompositeTag { public String[] getIds () { return (new String[] { "ITEM" }); } } class Title extends CompositeTag { public String[] getIds () { return (new String[] { "TITLE" }); } } class Description extends CompositeTag { public String[] getIds () { return (new String[] { "DESCRIPTION" }); } } class Link extends CompositeTag { public String[] getIds () { return (new String[] { "LINK" }); } } class Guid extends CompositeTag { public String[] getIds () { return (new String[] { "GUID" }); } } class PubDate extends CompositeTag { public String[] getIds () { return (new String[] { "PUBDATE" }); } } public class ResourceDescriptionFrameworkSiteSummary { public static void main (String[] args) throws ParserException { Parser parser; PrototypicalNodeFactory factory; NodeList list; Item item; NodeList kids; Node node; Tag tag; String name; parser = new Parser ("http://sourceforge.net/export/rss2_projsummary.php?group_id=24399"); factory = new PrototypicalNodeFactory (true); // empty factory.registerTag (new Item ()); factory.registerTag (new Title ()); factory.registerTag (new Description ()); factory.registerTag (new Link ()); factory.registerTag (new Guid ()); factory.registerTag (new PubDate ()); parser.setNodeFactory (factory); list = parser.extractAllNodesThatMatch (new NodeClassFilter (Item.class)); for (NodeIterator iterator = list.elements (); iterator.hasMoreNodes (); ) { item = (Item)iterator.nextNode (); kids = item.getChildren (); if (null != kids) for (int i = 0; i < kids.size (); i++) { node = kids.elementAt (i); if (node instanceof Tag) { tag = (Tag)node; name = tag.getTagName (); if (name.equals ("TITLE") || name.equals ("DESCRIPTION")) System.out.println (tag.toPlainTextString ()); } } } } } <p>This isn't as pretty as it could be, but you get the idea: <pre> Project name: HTML Parser Project description: HTML Parser is a library, written in Java, which allows you to parse HTML (HTML 4.0 supported). It has been used by people on live projects. Developers appreciate how easy it is to use. The architecture is flexible, allowing you to extend it easily. Developers on project: 16 Project administrators: &#60;a href=&#34;http://sourceforge.net/users/derrickoswald/&#34;&#62;derrickoswald&#60;/a&#62;, &#60;a href=&#34;http://sourceforge.net/users/somik/&#34;&#62;somik&#60;/a&#62; Activity percentile (last week): 98.3413% Most recent daily statistics (24 Jan 2004): Ranking: 251, Activity percentile: 98.34%, Downloadable files: 25615 total downloads to date Most recent daily statistics (24 Jan 2004): Download count: 19 Mailing lists (public): 4 Public mailing lists: htmlparser-developer, htmlparser-announce, htmlparser-user, htmlparser-cvs Discussion forums (public): 2, containing 110 messages Public discussion forums: Open Discussion, Help, htmlparser-user, htmlparser-developer Tracker: Bugs (1 open/158 total) Tracker description: Bug Tracking System Tracker: Support Requests (1 open/20 total) Tracker description: Tech Support Tracking System Tracker: Patches (0 open/0 total) Tracker description: Patch Tracking System Tracker: Feature Requests (2 open/10 total) Tracker description: Feature Request Tracking System CVS (8169 commits/809 adds) Most recent daily statistics (24 Jan 2004): Commit count: 0; Add count: 0 &#60;br&#62;&#60;a href=&#34;http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/htmlparser/&#34;&#62;[Web-based access to repository]&#60;/a&#62; <div id="actionbar" class="toolbar"> <hr class="printer" noshade="noshade" /> <p class="editdate">Last edited on Tuesday, January 27, 2004 6:04:21 pm. <hr class="toolbar" noshade="noshade" /> </body></html> Index: SamplePrograms.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/SamplePrograms.html,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** SamplePrograms.html 26 Jan 2004 01:02:09 -0000 1.7 --- SamplePrograms.html 29 Feb 2004 16:48:43 -0000 1.8 *************** *** 22,25 **** --- 22,27 ---- <li><a href="JavaBeans.html" class="wiki">JavaBeans</a> + <li><a href="RSSFeeds.html" class="named-wiki" title="RSSFeeds">Parsing RSS Feeds</a> + <li><a href="WebCrawler.html" class="wiki">WebCrawler</a> - ignore this, it's old *************** *** 33,37 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Wednesday, January 7, 2004 6:12:30 pm. <hr class="toolbar" noshade="noshade" /> --- 35,39 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Tuesday, January 27, 2004 5:25:45 pm. <hr class="toolbar" noshade="noshade" /> Index: UsingCookiesWithParser.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/UsingCookiesWithParser.html,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** UsingCookiesWithParser.html 26 Oct 2003 19:46:17 -0000 1.5 --- UsingCookiesWithParser.html 29 Feb 2004 16:48:43 -0000 1.6 *************** *** 6,10 **** <p><b>Using Cookies with the Parser ! <p><b>Problem: (by <span class="wikiunknown"><u>ShanSivakolundhu) <br /> In order to access a particular site I neet to have --- 6,10 ---- <p><b>Using Cookies with the Parser ! <p><b>Problem: (by ShanSivakolundhu) <br /> In order to access a particular site I neet to have *************** *** 16,20 **** URLConnection.connect(); ! <p><b>Solution: (by <span class="wikiunknown"><u>BobLewis) <br /> In order to send cookies in your Http requests, all --- 16,20 ---- URLConnection.connect(); ! <p><b>Solution: (by BobLewis) <br /> In order to send cookies in your Http requests, all *************** *** 22,92 **** URL Connection. ! <p>Generally what I've done is first create a ! HttpURLConnection, create some Cookie objects that are ! needed, and set the HTTP Header using those objects ! (See below for code to format the header value). ! ! <p>Then I'll create the Parser using the URLConnection ! something like this: <pre> ! DefaultHTMLParserFeedback feedback = ! new DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); ! ! HTMLReader reader = null; ! HTMLParser parser = null; ! String charset = HttpUtil.getCharacterSet(urlConn); ! ! InputStreamReader isr = ! new InputStreamReader(urlConn.getInputStream(), charset); ! reader = new HTMLReader(isr, 8192); ! parser = new HTMLParser(reader, feedback); ! <p>The <span class="wikiunknown"><u>HttpUtil.getCharacterSet method used above is ! basically just taken from the method of the same name ! in the HTMLParser class. That method is protected, so ! I had to duplicate it elsewhere. ! <pre> /** ! * set cookies to send in a HttpURLConnection<br> ! * This method should only be called before any ! * parameters are posted ! * and before the connection is made. ! * @param urlConn the HttpURLConnection to send ! * the cookies through ! * @param cookies the cookies to send */ ! public static void postCookies(HttpURLConnection urlConn, Cookie[] cookies) { ! if ((cookies == null) || (cookies.length ==0)) { ! return; ! } ! String[] cookieHeaders = new String[cookies.length]; ! urlConn.setRequestProperty("cookie",generateCookieHeader(cookies)); ! } /** ! * generate a HTTP cookie header value string ! * from an array of cookies ! * @param cookies the cookies which should be set ! * in the header value ! * @return A string containing the HTTP Cookie ! * Header value */ ! private static String generateCookieHeader(Cookie[] cookies) { ! StringBuffer buf = new StringBuffer(); ! for (int i=0; i < cookies.length;i++) { ! buf.append(cookies[i].getName()); ! buf.append("="); ! buf.append(cookies[i].getValue()); ! if (i+1 != cookies.length) { ! buf.append("; "); } - else buf.append(" "); } ! return buf.toString(); } --- 22,138 ---- URL Connection. ! <p>Create the URL and open the connection, but before passing ! the connection to the parser, set the "Cookie" request property: <pre> ! import java.net.URL; ! import java.net.URLConnection; ! import javax.servlet.http.Cookie; ! import org.htmlparser.Parser; ! import org.htmlparser.util.NodeIterator; ! /** ! * Demonstrate cookie usage with the HTML Parser. ! */ ! public class CookieDemo ! { /** ! * The cookies. ! * You'll need to get these from your browser's cookie jar or somewhere. ! * Only the cookies that apply to the URL you are using and haven't expired ! * are supposed to be passed in the request. ! * This is only part of a real cookie, much longer than shown. */ ! public static Cookie[] cookies = ! { ! new Cookie ("user", "%2536%2535%2538%2531%2539%2530%253a etc."), ! }; /** ! * Generate a HTTP cookie header value string from an array of cookies. ! * <pre> ! * The syntax for the header is: ! * ! * cookie = "Cookie:" cookie-version ! * 1*((";" | ",") cookie-value) ! * cookie-value = NAME "=" VALUE [";" path] [";" domain] ! * cookie-version = "$Version" "=" value ! * NAME = attr ! * VALUE = value ! * path = "$Path" "=" value ! * domain = "$Domain" "=" value ! * ! * </pre> ! * @param cookies The cookies which should be set in the header value. ! * @return A string containing the HTTP Cookie Header value. ! * @see <a href="http://www.ietf.org/rfc/rfc2109.txt">RFC 2109</a> */ ! public static String generateCookieHeader (Cookie[] cookies) ! { ! int version; ! boolean quote; ! StringBuffer ret; ! ret = new StringBuffer (); ! ! version = 0; ! for (int i = 0; i < cookies.length; i++) ! version = Math.max (version, cookies[i].getVersion ()); ! if (0 != version) ! { ! ret.append ("$Version=\""); ! ret.append (version); ! ret.append ("\""); ! } ! for (int i = 0; i < cookies.length; i++) ! { ! if (0 != ret.length ()) ! ret.append ("; "); ! ret.append (cookies[i].getName ()); ! ret.append ("="); ! if (0 != version) ! ret.append ("\""); ! ret.append (cookies[i].getValue ()); ! if (0 != version) ! ret.append ("\""); ! if (0 != version) ! { ! if ((null != cookies[i].getPath ()) ! && (0 != cookies[i].getPath ().length ())) ! { ! ret.append ("; $Path=\""); ! ret.append (cookies[i].getPath ()); ! ret.append ("\""); ! } ! if ((null != cookies[i].getDomain ()) ! && (0 != cookies[i].getDomain ().length ())) ! { ! ret.append ("; $Domain=\""); ! ret.append (cookies[i].getDomain ()); ! ret.append ("\""); ! } } } ! ! return (ret.toString ()); } + public static void main (String[] args) throws Exception + { + Parser parser; + URL url; + URLConnection connection; + + parser = new Parser (); + url = new URL ("http://slashdot.org"); + connection = url.openConnection (); + connection.setRequestProperty ("Cookie", generateCookieHeader (cookies)); + parser.setConnection (connection); + for (NodeIterator iterator = parser.elements (); iterator.hasMoreNodes (); ) + System.out.println (iterator.nextNode ()); + } + } + *************** *** 95,99 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Wednesday, April 2, 2003 3:04:24 pm. <hr class="toolbar" noshade="noshade" /> --- 141,145 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Monday, January 26, 2004 7:26:47 pm. <hr class="toolbar" noshade="noshade" /> |
From: <der...@us...> - 2004-02-29 15:28:07
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18375/src/org/htmlparser/tags Modified Files: CompositeTag.java Log Message: Javadoc changes. Fix the "low hanging fruit" javadoc issues. Index: CompositeTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/CompositeTag.java,v retrieving revision 1.75 retrieving revision 1.76 diff -C2 -d -r1.75 -r1.76 *** CompositeTag.java 25 Jan 2004 21:33:11 -0000 1.75 --- CompositeTag.java 29 Feb 2004 15:09:57 -0000 1.76 *************** *** 238,242 **** * sensitive. Otherwise, the search string and the node text are converted * to uppercase using the locale provided. ! * @parem locale The locale for uppercase conversion. * @return A collection of nodes whose string contents or * representation have the <code>searchString</code> in them. --- 238,242 ---- * sensitive. Otherwise, the search string and the node text are converted * to uppercase using the locale provided. ! * @param locale The locale for uppercase conversion. * @return A collection of nodes whose string contents or * representation have the <code>searchString</code> in them. |
From: <der...@us...> - 2004-02-29 15:28:07
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/nodeDecorators In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18375/src/org/htmlparser/nodeDecorators Added Files: package.html Log Message: Javadoc changes. Fix the "low hanging fruit" javadoc issues. --- NEW FILE: package.html --- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <!-- HTMLParser Library $Name: $ - A java-based parser for HTML http://sourceforge.org/projects/htmlparser Copyright (C) 2004 Somik Raha Revision Control Information $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/nodeDecorators/package.html,v $ $Author: derrickoswald $ $Date: 2004/02/29 15:09:57 $ $Revision: 1.1 $ This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA --> <TITLE>Node Decorators Package</TITLE> </HEAD> <BODY> The nodeDecorators package contains classes that use the Decorator pattern. The nodeDecorators package contains example decorators that alter node behaviour. For example, the DecodingNode class overrides the toPlainTextString() method of all nodes it wraps and applies the Translate class decode() method to the original node output: <pre> StringBuffer content = new StringBuffer (1024); StringNodeFactory factory = new StringNodeFactory (); factory.setDecode (true); createParser ("http://whatever"); parser.setNodeFactory (factory); NodeIterator iterator = parser.elements (); while (iterator.hasMoreNodes ()) content.append (iterator.nextNode ().toPlainTextString ()); System.out.println (content.toString ()); </pre> Decorators are a powerful way of performing the same operation on every node. <pre> </pre> </BODY> </HTML> |
From: <der...@us...> - 2004-02-29 15:28:06
|
Update of /cvsroot/htmlparser/htmlparser/src/doc-files In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18375/src/doc-files Modified Files: overview.html Log Message: Javadoc changes. Fix the "low hanging fruit" javadoc issues. Index: overview.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/doc-files/overview.html,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** overview.html 31 Jan 2004 16:31:20 -0000 1.2 --- overview.html 29 Feb 2004 15:09:56 -0000 1.3 *************** *** 11,17 **** --- 11,19 ---- <h2>Components</h2> The HTML Parser distribution is composed of: + <ul> <li>a low level {@link org.htmlparser.lexer.Lexer lexer} that converts characters into tags</li> <li>a high level {@link org.htmlparser.Parser parser} that provides a heirarchical document view</li> <li>several example applications</li> + </ul> <p> <h2>Building</h2> *************** *** 33,41 **** summarizes the purpose and target issues for each list. <ul> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21604&group_id=24399&func=browse"> Applications</A> - Work associated with the sample applications included with the HTML Parser download is tracked by this list. This would also include proposals for other example applications.</li> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21648&group_id=24399&func=browse"> Release</A> - Work to be done before a major release is tracked by this list. Items included here must be resolved before the major release is considered --- 35,43 ---- summarizes the purpose and target issues for each list. <ul> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21604&group_id=24399&func=browse" target="_top"> Applications</A> - Work associated with the sample applications included with the HTML Parser download is tracked by this list. This would also include proposals for other example applications.</li> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21648&group_id=24399&func=browse" target="_top"> Release</A> - Work to be done before a major release is tracked by this list. Items included here must be resolved before the major release is considered *************** *** 44,48 **** or scalability enhancements, memory usage issues and other 'quality' issues that are not associated with a specific bug.</li> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21601&group_id=24399&func=browse"> API</A> - Work needed to enhance or fix the parser API is tracked by this list. Standards compliance, additional classes, method signatures, changes to --- 46,50 ---- or scalability enhancements, memory usage issues and other 'quality' issues that are not associated with a specific bug.</li> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21601&group_id=24399&func=browse" target="_top"> API</A> - Work needed to enhance or fix the parser API is tracked by this list. Standards compliance, additional classes, method signatures, changes to *************** *** 51,55 **** should be limited to those changes that could impact the developer community that relies on existing behaviour from the parser.</li> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21602&group_id=24399&func=browse"> Documentation</A> - Work associated with documenting the parser and it's example code and sample applications is tracked by this list. Javadocs, the --- 53,57 ---- should be limited to those changes that could impact the developer community that relies on existing behaviour from the parser.</li> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21602&group_id=24399&func=browse" target="_top"> Documentation</A> - Work associated with documenting the parser and it's example code and sample applications is tracked by this list. Javadocs, the *************** *** 59,64 **** </ul> <p> ! The <A HREF="http://sourceforge.net/tracker/?group_id=24399&atid=381402">Request ! For Enhancement</A> list contains items that are proposed for future versions of the parser. Users may add to this list what they feel are extensions beyond simple bug fixing. Some user entered bugs are also transferred to this list if --- 61,66 ---- </ul> <p> ! The <A HREF="http://sourceforge.net/tracker/?group_id=24399&atid=381402" target="_top"> ! Request For Enhancement</A> list contains items that are proposed for future versions of the parser. Users may add to this list what they feel are extensions beyond simple bug fixing. Some user entered bugs are also transferred to this list if |
From: <der...@us...> - 2004-02-29 15:28:06
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18375/src/org/htmlparser/parserapplications Modified Files: StringExtractor.java Log Message: Javadoc changes. Fix the "low hanging fruit" javadoc issues. Index: StringExtractor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/StringExtractor.java,v retrieving revision 1.46 retrieving revision 1.47 diff -C2 -d -r1.46 -r1.47 *** StringExtractor.java 2 Jan 2004 16:24:54 -0000 1.46 --- StringExtractor.java 29 Feb 2004 15:09:56 -0000 1.47 *************** *** 30,33 **** --- 30,39 ---- import org.htmlparser.util.ParserException; + /** + * Extract plaintext strings from a web page. + * Illustrative program to gather the textual contents of a web page. + * Uses a {@link org.htmlparser.beans.StringBean StringBean} to accumulate + * the user visible text (what a browser would display) into a single string. + */ public class StringExtractor { |
From: <der...@us...> - 2004-02-29 15:28:05
|
Update of /cvsroot/htmlparser/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18375 Modified Files: build.xml Log Message: Javadoc changes. Fix the "low hanging fruit" javadoc issues. Index: build.xml =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/build.xml,v retrieving revision 1.59 retrieving revision 1.60 diff -C2 -d -r1.59 -r1.60 *** build.xml 9 Feb 2004 02:09:43 -0000 1.59 --- build.xml 29 Feb 2004 15:09:56 -0000 1.60 *************** *** 345,355 **** <taglet name="HtmlTaglet" path="${resources}:${src}"/> <group title="Main Package" packages="org.htmlparser"/> ! <group title="Example Applications" packages="org.htmlparser.parserapplications"/> <group title="Tags" packages="org.htmlparser.tags,org.htmlparser.tags.data"/> <group title="Lexer" packages="org.htmlparser.lexer,org.htmlparser.lexer.nodes"/> <group title="Scanners" packages="org.htmlparser.scanners"/> <group title="Beans" packages="org.htmlparser.beans"/> ! <group title="Visitors" packages="org.htmlparser.visitors"/> ! <group title="Utility Packages (of developer interest only)" packages="org.htmlparser.util,org.htmlparser.util.sort"/> <link href="http://java.sun.com/j2se/1.4.2/docs/api/"/> </javadoc> --- 345,355 ---- <taglet name="HtmlTaglet" path="${resources}:${src}"/> <group title="Main Package" packages="org.htmlparser"/> ! <group title="Example Applications" packages="org.htmlparser.parserapplications,org.htmlparser.lexerapplications.tabby,org.htmlparser.lexerapplications.thumbelina"/> <group title="Tags" packages="org.htmlparser.tags,org.htmlparser.tags.data"/> <group title="Lexer" packages="org.htmlparser.lexer,org.htmlparser.lexer.nodes"/> <group title="Scanners" packages="org.htmlparser.scanners"/> <group title="Beans" packages="org.htmlparser.beans"/> ! <group title="Patterns" packages="org.htmlparser.visitors,org.htmlparser.nodeDecorators,org.htmlparser.filters"/> ! <group title="Utility" packages="org.htmlparser.util,org.htmlparser.util.sort"/> <link href="http://java.sun.com/j2se/1.4.2/docs/api/"/> </javadoc> |