Re: [Htmlparser-user] Presentation tags
Brought to you by:
derrickoswald
From: Derrick O. <der...@gm...> - 2011-06-30 13:48:15
|
Hi Chris, Although you might find it difficult to work with, the way the tags and text are returned makes sense. To avoid splitting the word, the program would need to make a judgement call regarding which text is a word and belongs together and which of the tags to ignore. You can probably get what you want by using something like the StringBean class which extracts just the text. I think you'll find the output of the StringBean will have the word "northern" rather than separating it, i.e. it doesn't inject any whitespace. You can use similar code in your program to paste the text back together. Derrick On Thu, Jun 30, 2011 at 11:39 AM, Chris Bamford <cba...@mi...>wrote: > Hi there, > > I use Aperture to extract text which runs Htmlparser when processing HTML. > My question relates to the handling of presentation tags such as <u>, <b>, > <i> when embedded within words - for example: > > <html><body><u>north</u>ern</body></html> > > What I would expect is that I should be delivered the word "northern" - but > instead I get two tokens: "north" and "ern", which is clearly wrong in this > context. > It seems that Htmlparser is replacing tags with whitespace - why is this? > > Thanks for any help. > > - Chris > > > > ------------------------------------------------------------------------------ > All of the data generated in your IT infrastructure is seriously valuable. > Why? It contains a definitive record of application performance, security > threats, fraudulent activity, and more. Splunk takes this data and makes > sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-d2d-c2 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |