Screenshot instructions:
Windows
Mac
Red Hat Linux
Ubuntu
Click URL instructions:
Right-click on ad, choose "Copy Link", then paste here →
(This may not be possible with some types of ads)
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
1
|
2
(1) |
3
|
4
|
5
|
6
|
7
(5) |
8
(2) |
9
(6) |
10
(4) |
11
(5) |
12
(6) |
13
(5) |
14
(4) |
15
(1) |
16
(4) |
17
(14) |
18
(4) |
19
(6) |
20
(3) |
21
(2) |
22
(2) |
23
(2) |
24
(17) |
25
|
26
(1) |
27
(4) |
28
(9) |
29
(5) |
30
(4) |
31
(3) |
|
|
|
|
From: <dhaval.h.udani@or...> - 2002-12-12 13:52:30
|
Hi, I agree with the views expressed here and agree on the 2 basic requirements as pointed out here: > 1. return the original HTML I think as per the rest of the parser this activity should be done using the toHTML() method. 2. return the text appearing within it that is not a default part of the tag This should be done with the toPlainTextString() method. Apart from this I have'nt understood much of this thread. Bye, Dhaval |
From: Sam Joseph <gaijin@yh...> - 2002-12-12 13:40:21
|
Hi Somik, Somik Raha wrote: >>Thanks for the help. I think I would like to see the >>toPlainTextString() method remain. Although I'm not quite sure of the >>difference between HTMLRemarkNode.toString and >>HTMLRemarkNode.toPlainTextString. >> >> > >This is actually based on your suggestion (eons back..) - >toPlainTextString() is the uniform way of getting string representation of a >page - meaningful and hopefully semantic data. I think you'd probably want >to use toPlainTextString() instead of toString() - as toString() always >gives some output for all the tags, while toPlainTextString() works only for >specific ones like string nodes, link text and strings inside forms. It was >also enabled earlier for comments, but was taken out last week. I am >thinking of putting it back in. What this will mean is that if folks have >commented tags - you will get that sort of data in your string filter. I >think you can live with that (?) > >Also - I am thinking of a better approach - wherein, should one require pure >strings within a comment, one could create a new parser, that operates on >the contents of the string node (it would be an interesting approach to >try..) > I'm not sure that I'm following you. But then its late here .... It would seem that whatever other considerations there might be one would want to have some method on HTMLRemarkNode that allows you to grab the pure unadulterated text of the remark without anything else. The HTMLRemarkNode.toString() method I'm using now seems to be appending the string "Comment Tag :" to the front of the string that is returned. Its nice to have convenience methods to pretty print things. But shouldn't the two default methods on any node be to: 1. return the original HTML 2. return the text appearing within it that is not a default part of the tag Naturally there will be variation depending on the node, but it seems odd to have prettified print responses as the default (maybe they're not and I'm just getting confused) - ideally they would be called with a parameter or special method like prettyPrint(). I'm not sure what the downside is to having a toPlainTextString() call in the HTMLRemarkNode. Remember I don't have such a wonderful understanding of the HTMLParser itself. For example I'm not sure what you mean when you say that the remark text data would appear in your string filter. I'm not sure what a string filter is ... At the moment it seems I have to explicitly check for HTMLRemarkNodes and then process them if I want to .... CHEERS> SAM |
From: Somik Raha <somik@ya...> - 2002-12-12 05:21:26
|
Hi Sam, > Also, I solved my problem with the debugging output. The problem was > with the code I was using to output the final data. The print() command ... Oops.. > Thanks for the help. I think I would like to see the > toPlainTextString() method remain. Although I'm not quite sure of the > difference between HTMLRemarkNode.toString and > HTMLRemarkNode.toPlainTextString. This is actually based on your suggestion (eons back..) - toPlainTextString() is the uniform way of getting string representation of a page - meaningful and hopefully semantic data. I think you'd probably want to use toPlainTextString() instead of toString() - as toString() always gives some output for all the tags, while toPlainTextString() works only for specific ones like string nodes, link text and strings inside forms. It was also enabled earlier for comments, but was taken out last week. I am thinking of putting it back in. What this will mean is that if folks have commented tags - you will get that sort of data in your string filter. I think you can live with that (?) Also - I am thinking of a better approach - wherein, should one require pure strings within a comment, one could create a new parser, that operates on the contents of the string node (it would be an interesting approach to try..) Regards, Somik ----- Original Message ----- From: "Sam Joseph" <gaijin@...> To: <htmlparser-developer@...> Sent: Wednesday, December 11, 2002 8:05 PM Subject: Re: [Htmlparser-developer] HTML Comments/Remarks > Hi Somik, > > Thanks for the help. I think I would like to see the > toPlainTextString() method remain. Although I'm not quite sure of the > difference between HTMLRemarkNode.toString and > HTMLRemarkNode.toPlainTextString. > > Trying out both in my code I see that toPlainTextString() seems to > generate a blank while toString() gives me the contents of the > remark/comment. To be specific about my objectives, I'm trying to > handle meta-data by the creative commons group which currently involved > placing a big chunk of rdf/xml in a remark within the page. I'm very > much hoping to be able to extract that comment verbatim and then pass it > over to my rdf/xml parser. > > I'll be happy as long as I can achieve that. > > Also, I solved my problem with the debugging output. The problem was > with the code I was using to output the final data. The print() command > was being called on links and meta-tags, and the way that ant formatted > things it made it look like the associated System.out calls were being > made during the parsing process rather than at the end. Sorry about > that, all fixed now, so don't worry about looking at the code that I > sent you in my previous email. > > Thanks again for all your help. I'm looking forward to fully > integrating HTMLParser with NeuroGrid over the next two days. > > CHEERS> SAM > > Somik Raha wrote: > > >Hi Sam, > > HTMLRemarkNode is a special class -it is not a > >scanner. > > It is registered by default - so you dont have to do > >anything - just check if the node object is a remark > >node. > > > > However, last week, I removed the > >toPlainTextString() implementation as it often a lot > >of HTML code is commented out, and I thought it might > >interfere with a simple string representation of a > >page. If that is not the case and you need to use > >toPlainTextString(), pls let us know, and we should > >put that functionality back in. > > > >Regards, > >Somik > >--- Sam Joseph <gaijin@...> wrote: > > > > > >>Hi Somik > >> > >>Sorry to ask so much this week, but I was wondering > >>it there some operation for picking up HTML comments > >>using the HTMLParser (<!-- a comment -->) or are > >>they automatically ignored? > >> > >>I can see from the API that there is HTMLRemarkNode, > >>but I can't see any similar tag or scanner. Must a > >>special tag/scanner be created to handle > >>comments/remarks? > >> > >>Thanks in advance. > >> > >>CHEERS> SAM > >> > >> > >> > >> > >> > >> > >> > >> > >------------------------------------------------------- > > > > > >>This sf.net email is sponsored by: > >>With Great Power, Comes Great Responsibility > >>Learn to use your power at OSDN's High Performance > >>Computing Channel > >>http://hpc.devchannel.org/ > >>_______________________________________________ > >>Htmlparser-developer mailing list > >>Htmlparser-developer@... > >> > >> > >> > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > >__________________________________________________ > >Do you Yahoo!? > >Yahoo! Mail Plus - Powerful. Affordable. Sign up now. > >http://mailplus.yahoo.com > > > > > >------------------------------------------------------- > >This sf.net email is sponsored by: > >With Great Power, Comes Great Responsibility > >Learn to use your power at OSDN's High Performance Computing Channel > >http://hpc.devchannel.org/ > >_______________________________________________ > >Htmlparser-developer mailing list > >Htmlparser-developer@... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > > > > > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by: > With Great Power, Comes Great Responsibility > Learn to use your power at OSDN's High Performance Computing Channel > http://hpc.devchannel.org/ > _______________________________________________ > Htmlparser-developer mailing list > Htmlparser-developer@... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik Raha <somik@ya...> - 2002-12-12 05:15:28
|
Hi Sam, The parse() is not being called, but the print() method is. From three places : NeurogridHTMLParserTest.printLinks() NeurogridHTMLParserTest.printMetaTags() NeurogridHTMLParser.searchForSummaryContents() If you mask this, output will be as you desire. Regards, Somik ----- Original Message ----- From: "Sam Joseph" <gaijin@...> To: <htmlparser-developer@...> Sent: Wednesday, December 11, 2002 3:41 PM Subject: [Htmlparser-developer] Re: Htmlparser-developer digest, Vol 1 #136 - 3 msgs > Hi Somik, > > Sorry that my mails are not attaching to the thread properly. I'm on > digest, so when I reply to the digest meesage I think a new thread get > automatically started, and the sourceforge mail interface doesn't let me > reply directly to your messages > > Thanks for your suggestion below. As far as I can see from the code the > parse method on HTMLParser is not being called. In fact it uses exactly > the think you describe in your mail. I didn't really write this code. > It's still basically the NeuroGridHTMLParser that you wrote a while > back, modified into my coding format. > > Please find the code appended to this email. Both the links I have been > parsing are specified in the NeuroGridHTMLParserTest.java file. > > Thanks in advance. > > CHEERS> SAM > > Somik wrote: > > >Sorry, I just saw your other mail again with the > >output. I see the problem - > > > >You must be calling the parse method in > >HTMLParser.java. That is only a demo. As mentioned in > >the docs, you should be doing something like : > > > >(for HTMLEnumeration e = > >parser.elements();e.hasMoreNodes();) { > > HTMLNode node = e.nextHTMLNode(); > > // create summary here > >} > > > >The call to parse has the printing stuff which prints > >all the details of the nodes (calling node.print()). > > > >If this does not help, can you post your complete > >parsing program ? > > > > ---------------------------------------------------------------------------- ---- > /* > * (c) Copyright 2001 MyCorporation. > * All Rights Reserved. > */ > package com.neurogrid.parser; > /** > * @version 1.0 > * @author > */ > public class Summary { > private String heading; > private String contents; > /** > * Constructor for Summary. > */ > public Summary(String heading, String contents) { > this.heading = heading; > this.contents = contents; > } > > /** > * Gets the heading. > * @return Returns a String > */ > public String getHeading() { > return heading; > } > > /** > * Sets the heading. > * @param heading The heading to set > */ > public void setHeading(String heading) { > this.heading = heading; > } > > /** > * Gets the contents. > * @return Returns a String > */ > public String getContents() { > return contents; > } > > /** > * Sets the contents. > * @param contents The contents to set > */ > public void setContents(String contents) { > this.contents = contents; > } > > public String toString() { > String retString; > if (heading.length()>0) retString = heading+"\n"+contents; > else retString = contents; > return retString; > } > } > ---------------------------------------------------------------------------- ---- > package com.neurogrid.parser; > > /* > * Copyright (C) 2000 NeuroGrid <sam@...> > * > * This program is free software; you can redistribute it and/or > * modify it under the terms of the GNU General Public License > * as published by the Free Software Foundation; either version 2 > * of the License, or (at your option) any later version. > * > * This program is distributed in the hope that it will be useful, > * but WITHOUT ANY WARRANTY; without even the implied warranty of > * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > * GNU General Public License for more details. > * > * You should have received a copy of the GNU General Public License > * along with this program; if not, write to the Free Software > * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. > * > * You may find further details about this software at > * http://www.neurogrid.net/ > */ > > import junit.framework.*; > > // Import log4j classes. > import org.apache.log4j.Category; > import org.apache.log4j.BasicConfigurator; > import org.apache.log4j.PropertyConfigurator; > > import org.htmlparser.*; > import org.htmlparser.tags.*; > import org.htmlparser.scanners.*; > import org.htmlparser.util.*; > import java.util.Enumeration; > import java.util.Vector; > > /** > * @version 1.0 > * @author > */ > public class NeuroGridHTMLParser > { > private static final String cvsInfo = "$Id:$"; > public static String getCvsInfo() > { > return cvsInfo; > } > > private static Category o_cat = Category.getInstance(NeuroGridHTMLParser.class.getName()); > > /** > * initialize the logging system > * > * @param p_conf configuration filename > */ > public static void init(String p_conf) > { > BasicConfigurator.configure(); > PropertyConfigurator.configure(p_conf); > o_cat.info("NeuroGridHTMLParser logging Initialized"); > } > > private String o_url; > private String o_full_text; > private Vector o_meta_tags; > private Vector o_link_tags; > private Summary o_summary; > private StringBuffer o_summary_heading; > private StringBuffer o_summary_contents; > private HTMLParser o_parser = null; > private boolean o_h1_tag_found = false; > private boolean o_start_summary_search = false; > private int o_summary_count = 0; > > > /** > * This constructor is only to enable test cases. > * For clients, pls use NeuroGridHTMLParser(String) > * or NeuroGridHTMLParser(String,boolean) > * > * @param p_parser > */ > public NeuroGridHTMLParser(HTMLParser p_parser) > throws Exception > { > this("",false); > o_parser = p_parser; > } > > /** > * > * @param p_url > */ > public NeuroGridHTMLParser(String p_url) > throws Exception > { > this(p_url,true); > } > > /** > * > * @param p_url > * @param p_start_parsing > */ > public NeuroGridHTMLParser(String p_url, boolean p_start_parsing) > throws Exception > { > o_url = p_url; > o_meta_tags = new Vector(); > o_link_tags = new Vector(); > o_summary_heading = new StringBuffer(); > o_summary_contents = new StringBuffer(); > if (p_start_parsing) parse(); > } > > private class BlankHTMLParserFeedback > implements HTMLParserFeedback > { > public void info(String message) > { > //System.out.println("INFO: " + message); > } > > public void warning(String message) > { > //System.out.println("WARNING: " + message); > } > > public void error(String message, HTMLParserException e) > { > //System.out.println("ERROR: " + message); > e.printStackTrace(); > } > } > > > > /** > * parse the page > */ > public final void parse() > throws Exception > { > if (o_parser==null) > o_parser = new HTMLParser(o_url, new BlankHTMLParserFeedback()); > > o_parser.addScanner(new HTMLMetaTagScanner("-t")); > o_parser.addScanner(new HTMLLinkScanner("-l")); > o_parser.addScanner(new HTMLTitleScanner("-a")); > parseURLForData(); > o_summary = createSummary(); > } > > /** > * parse the URL for data > */ > private void parseURLForData() > throws Exception > { > HTMLNode x_node; > for (HTMLEnumeration e = o_parser.elements();e.hasMoreNodes();) > { > x_node = e.nextHTMLNode(); > checkForTitle(x_node); > checkForMetaTag(x_node); > checkForLinkTag(x_node); > checkForTag(x_node); > if(o_h1_tag_found == true) > { > o_h1_tag_found = processH1Tag(x_node); > } > else > { > if (o_start_summary_search) > { > searchForSummaryContents(x_node); > } > addToFullText(x_node); > } > > } > } > > /** > * parse the URL for data > * > * @param HTMLNode > */ > protected void checkForTitle(HTMLNode p_node) > { > if(p_node instanceof HTMLTitleTag) > { > String x_title = ((HTMLTitleTag)p_node).getTitle(); > o_cat.debug("appending title: " + x_title); > // I think it would be better to do one or the other of H1 and title. > //FIXXXXXXXXX > o_summary_heading.append(x_title+"\n"); > } > } > > /** > * add this nodes text to the full text > * > * @param HTMLNode > */ > private void addToFullText(HTMLNode p_node) > { > if(p_node instanceof HTMLStringNode) > { > o_full_text += ((HTMLStringNode)p_node).getText(); > } > } > > /** > * search for summary contents > * > * @param HTMLNode > */ > private void searchForSummaryContents(HTMLNode p_node) > { > if(p_node instanceof HTMLStringNode) > { > //o_cat.debug("*** SEARCHING FOR SUMMARY ***"); > p_node.print(); > String x_contents = ((HTMLStringNode)p_node).getText(); > if(x_contents.length()>0 && isAlphabetical(x_contents) && !isEmpty(x_contents)) > { > //o_cat.debug("x_contents = "+x_contents); > o_summary_count++; > o_summary_contents.append(x_contents+"\n"); > if(o_summary_count==2) > { > o_start_summary_search=false; > } > } > } > } > > /** > * check if this string is just spaces > * > * @param p_text > * > * @return boolean > */ > private boolean isEmpty(String p_text) > { > boolean x_empty = true; > for (int i=0;i<p_text.length();i++) > { > if (p_text.charAt(i) != ' ') > { > x_empty = false; > } > } > return x_empty; > } > > /** > * check if this string is alphabetical > * > * @param p_text > * > * @return boolean > */ > private boolean isAlphabetical(String p_text) > { > char x_ch; > p_text = p_text.toUpperCase(); > boolean x_return = true; > for(int i=0;i<p_text.length();i++) > { > x_ch = p_text.charAt(i); > if (!((x_ch>='A' && x_ch <='Z')|| (x_ch==' ' || x_ch=='.' || x_ch==','))) > { > x_return =false; > } > } > return x_return; > } > > /** > * check for a tag > * > * @param p_node > */ > private void checkForTag(HTMLNode p_node) > { > if(p_node instanceof HTMLTag) > { > HTMLTag x_tag = (HTMLTag)p_node; > checkForH1Tag(x_tag); > checkForBodyTag(x_tag); > } > } > > /** > * check for a body tag > * > * @param p_node > */ > private void checkForBodyTag(HTMLTag p_tag) > { > if(p_tag.getText().toUpperCase().indexOf("BODY")!=-1) > { > o_start_summary_search = true; > } > } > > /** > * check for an H1 tag > * > * @param p_node > */ > private void checkForH1Tag(HTMLTag tag) > { > if (tag.getText().toUpperCase().equals("H1")) > { > o_h1_tag_found = true; > } > } > > /** > * check for a meta tag > * > * @param p_node > */ > private void checkForMetaTag(HTMLNode p_node) > { > HTMLMetaTag x_meta_tag; > if(p_node instanceof HTMLMetaTag) > { > x_meta_tag = (HTMLMetaTag) p_node; > o_meta_tags.addElement(x_meta_tag); > } > } > > /** > * check for a link tag > * > * @param p_node > */ > private void checkForLinkTag(HTMLNode p_node) > { > HTMLLinkTag x_link_tag; > if(p_node instanceof HTMLLinkTag) > { > x_link_tag = (HTMLLinkTag)p_node; > o_link_tags.addElement(x_link_tag); > } > } > > /** > * process an H1 tag > * > * @param p_node > * > * @return boolean > */ > private boolean processH1Tag(HTMLNode p_node) > { > boolean x_h1_tag_found = true; > if(p_node instanceof HTMLStringNode) > { > o_summary_heading.append(((HTMLStringNode)p_node).getText()); > o_cat.debug("appending title: " + ((HTMLStringNode)p_node).getText()); > // I think it would be better to do one or the other of H1 and title. > //FIXXXXXXXXX > } > if(p_node instanceof HTMLEndTag) > { > HTMLEndTag x_end_tag =(HTMLEndTag)p_node; > //o_cat.debug("x_end_tag.toString(): " + x_end_tag.toString()); > //o_cat.debug("x_end_tag.toHTML(): " + x_end_tag.toHTML()); > //o_cat.debug("x_end_tag.toPlainTextString(): " + x_end_tag.toPlainTextString()); > //o_cat.debug("x_end_tag.getTagName(): " + x_end_tag.getTagName()); > //o_cat.debug("x_end_tag.getText(): " + x_end_tag.getText()); > if(x_end_tag.getTagName().toUpperCase().equals("H1")) > { > x_h1_tag_found = false; > } > } > return x_h1_tag_found; > } > > > > /** > * get the Summary > * > * @return Summary > */ > public Summary getSummary() > { > return o_summary; > } > > /** > * get the Full text > * > * @return String > */ > public String getFullText() > { > return o_full_text; > } > > /** > * get a vector of the links > * > * @return Vector > */ > public Vector links() > { > return o_link_tags; > } > > /** > * get a vector of meta tags > * > * @return Vector > */ > public Vector metaTags() > { > return o_meta_tags; > } > > /** > * create a summary > * > * @return Summary > */ > private Summary createSummary() > { > return new Summary(o_summary_heading.toString(),o_summary_contents.toString()); > } > > > /** > * main > * > * @param args > */ > public static void main(String[] args) > { > try > { > if (args.length==0) > { > o_cat.debug("Syntax:"); > o_cat.debug("java -jar neuroparser.jar URL"); > System.exit(-1); > } > o_cat.debug("Parsing "+args[0]+".."); > o_cat.debug(""); > NeuroGridHTMLParser parser = new NeuroGridHTMLParser(args[0]); > o_cat.debug("Printing links from "+args[0]); > o_cat.debug(""); > > printLinks(parser); > printMetaTags(args, parser); > printSummary(parser); > printFullText(parser); > } > catch(Exception e) > {e.printStackTrace();} > } > > public static void printSummary(NeuroGridHTMLParser parser) > { > o_cat.debug(""); > o_cat.debug("Summary"); > o_cat.debug("-------"); > o_cat.debug(parser.getSummary()); > o_cat.debug(""); > } > > public static void printFullText(NeuroGridHTMLParser parser) > { > o_cat.debug(""); > o_cat.debug("Full Text"); > o_cat.debug("-------"); > o_cat.debug(parser.getFullText()); > o_cat.debug(""); > } > > public static void printMetaTags(String[] args, NeuroGridHTMLParser parser) > { > HTMLMetaTag metaTag; > o_cat.debug(""); > o_cat.debug("Printing metaTags from "+args[0]); > o_cat.debug(""); > for(Enumeration e = parser.metaTags().elements();e.hasMoreElements();) > { > metaTag = (HTMLMetaTag)e.nextElement(); > metaTag.print(); > } > } > > public static void printLinks(NeuroGridHTMLParser parser) > { > HTMLLinkTag link; > for(Enumeration e =parser.links().elements();e.hasMoreElements();) > { > link = (HTMLLinkTag)e.nextElement(); > link.print(); > } > } > } > ---------------------------------------------------------------------------- ---- > package com.neurogrid.parser; > > /* > * Copyright (C) 2000 NeuroGrid <sam@...> > * > * This program is free software; you can redistribute it and/or > * modify it under the terms of the GNU General Public License > * as published by the Free Software Foundation; either version 2 > * of the License, or (at your option) any later version. > * > * This program is distributed in the hope that it will be useful, > * but WITHOUT ANY WARRANTY; without even the implied warranty of > * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > * GNU General Public License for more details. > * > * You should have received a copy of the GNU General Public License > * along with this program; if not, write to the Free Software > * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. > * > * You may find further details about this software at > * http://www.neurogrid.net/ > */ > > import junit.framework.*; > > // Import log4j classes. > import org.apache.log4j.Category; > import org.apache.log4j.BasicConfigurator; > import org.apache.log4j.PropertyConfigurator; > > import org.htmlparser.*; > import org.htmlparser.tags.*; > import org.htmlparser.scanners.*; > import java.util.Enumeration; > import java.util.Vector; > > > /** > * @version 1.0 > * @author > */ > public class NeuroGridHTMLParserTest > extends TestCase > { > private static final String cvsInfo = "$Id:$"; > public static String getCvsInfo() > { > return cvsInfo; > } > > private static Category o_cat = Category.getInstance(NeuroGridHTMLParserTest.class.getName()); > > /** > * initialize the logging system > * > * @param p_conf configuration filename > */ > public static void init(String p_conf) > { > BasicConfigurator.configure(); > PropertyConfigurator.configure(p_conf); > o_cat.info("NeuroGridHTMLParserTest logging Initialized"); > } > > public static void main(String[] args) > { > NeuroGridHTMLParserTest.start(); > NeuroGridHTMLParserTest.init(args[0]); > NeuroGridHTMLParserTest.testStuff(); > } > > /** > * Subclasses must invoke this from their constructor. > */ > public NeuroGridHTMLParserTest(String p_name) > { > super(p_name); > } > > protected void setUp() > { > start(); > } > > protected static void start() > { > try > { > NeuroGridHTMLParserTest.init("conf/log4j.properties"); > NeuroGridHTMLParser.init("conf/log4j.properties"); > } > catch(Exception e){e.printStackTrace();} > } > > /** > * test some stuff > */ > public static void testStuff() > { > try > { > // String x_url = "http://belle.designwest.com/examples/test04b.html";; > String x_url = "http://home.att.ne.jp/red/gaijin/tribal-hardware/index.htm";; > > o_cat.debug("Parsing "+x_url+".."); > o_cat.debug(""); > NeuroGridHTMLParser parser = new NeuroGridHTMLParser(x_url); > o_cat.debug("Printing links from "+x_url); > o_cat.debug(""); > > printLinks(parser); > printMetaTags(x_url, parser); > printSummary(parser); > printFullText(parser); > } > catch(Exception e) > {e.printStackTrace();} > } > > > public static void printSummary(NeuroGridHTMLParser parser) > { > o_cat.debug(""); > o_cat.debug("Summary"); > o_cat.debug("-------"); > o_cat.debug(parser.getSummary().getHeading()); > o_cat.debug("-------"); > o_cat.debug(parser.getSummary().getContents()); > o_cat.debug("-------"); > o_cat.debug(""); > } > > public static void printFullText(NeuroGridHTMLParser parser) > { > o_cat.debug(""); > o_cat.debug("Full Text"); > o_cat.debug("-------"); > o_cat.debug(parser.getFullText()); > o_cat.debug(""); > } > > public static void printMetaTags(String p_url, NeuroGridHTMLParser parser) > { > HTMLMetaTag metaTag; > o_cat.debug(""); > o_cat.debug("Printing metaTags from "+p_url); > o_cat.debug(""); > for(Enumeration e = parser.metaTags().elements();e.hasMoreElements();) > { > metaTag = (HTMLMetaTag)e.nextElement(); > metaTag.print(); > } > } > > public static void printLinks(NeuroGridHTMLParser parser) > { > HTMLLinkTag link; > for(Enumeration e =parser.links().elements();e.hasMoreElements();) > { > link = (HTMLLinkTag)e.nextElement(); > link.print(); > } > } > } > |
From: Sam Joseph <gaijin@yh...> - 2002-12-12 03:51:43
|
Hi Somik, Thanks for the help. I think I would like to see the toPlainTextString() method remain. Although I'm not quite sure of the difference between HTMLRemarkNode.toString and HTMLRemarkNode.toPlainTextString. Trying out both in my code I see that toPlainTextString() seems to generate a blank while toString() gives me the contents of the remark/comment. To be specific about my objectives, I'm trying to handle meta-data by the creative commons group which currently involved placing a big chunk of rdf/xml in a remark within the page. I'm very much hoping to be able to extract that comment verbatim and then pass it over to my rdf/xml parser. I'll be happy as long as I can achieve that. Also, I solved my problem with the debugging output. The problem was with the code I was using to output the final data. The print() command was being called on links and meta-tags, and the way that ant formatted things it made it look like the associated System.out calls were being made during the parsing process rather than at the end. Sorry about that, all fixed now, so don't worry about looking at the code that I sent you in my previous email. Thanks again for all your help. I'm looking forward to fully integrating HTMLParser with NeuroGrid over the next two days. CHEERS> SAM Somik Raha wrote: >Hi Sam, > HTMLRemarkNode is a special class -it is not a >scanner. > It is registered by default - so you dont have to do >anything - just check if the node object is a remark >node. > > However, last week, I removed the >toPlainTextString() implementation as it often a lot >of HTML code is commented out, and I thought it might >interfere with a simple string representation of a >page. If that is not the case and you need to use >toPlainTextString(), pls let us know, and we should >put that functionality back in. > >Regards, >Somik >--- Sam Joseph <gaijin@...> wrote: > > >>Hi Somik >> >>Sorry to ask so much this week, but I was wondering >>it there some operation for picking up HTML comments >>using the HTMLParser (<!-- a comment -->) or are >>they automatically ignored? >> >>I can see from the API that there is HTMLRemarkNode, >>but I can't see any similar tag or scanner. Must a >>special tag/scanner be created to handle >>comments/remarks? >> >>Thanks in advance. >> >>CHEERS> SAM >> >> >> >> >> >> >> >> >------------------------------------------------------- > > >>This sf.net email is sponsored by: >>With Great Power, Comes Great Responsibility >>Learn to use your power at OSDN's High Performance >>Computing Channel >>http://hpc.devchannel.org/ >>_______________________________________________ >>Htmlparser-developer mailing list >>Htmlparser-developer@... >> >> >> >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > >__________________________________________________ >Do you Yahoo!? >Yahoo! Mail Plus - Powerful. Affordable. Sign up now. >http://mailplus.yahoo.com > > >------------------------------------------------------- >This sf.net email is sponsored by: >With Great Power, Comes Great Responsibility >Learn to use your power at OSDN's High Performance Computing Channel >http://hpc.devchannel.org/ >_______________________________________________ >Htmlparser-developer mailing list >Htmlparser-developer@... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > |
From: Somik Raha <somik@ya...> - 2002-12-12 00:04:16
|
Hi Sam, HTMLRemarkNode is a special class -it is not a scanner. It is registered by default - so you dont have to do anything - just check if the node object is a remark node. However, last week, I removed the toPlainTextString() implementation as it often a lot of HTML code is commented out, and I thought it might interfere with a simple string representation of a page. If that is not the case and you need to use toPlainTextString(), pls let us know, and we should put that functionality back in. Regards, Somik --- Sam Joseph <gaijin@...> wrote: > Hi Somik > > Sorry to ask so much this week, but I was wondering > it there some operation for picking up HTML comments > using the HTMLParser (<!-- a comment -->) or are > they automatically ignored? > > I can see from the API that there is HTMLRemarkNode, > but I can't see any similar tag or scanner. Must a > special tag/scanner be created to handle > comments/remarks? > > Thanks in advance. > > CHEERS> SAM > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by: > With Great Power, Comes Great Responsibility > Learn to use your power at OSDN's High Performance > Computing Channel > http://hpc.devchannel.org/ > _______________________________________________ > Htmlparser-developer mailing list > Htmlparser-developer@... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |