Hi Chris,
Although you might find it difficult to work with, the way the tags and text
are returned makes sense.
To avoid splitting the word, the program would need to make a judgement call
regarding which text is a word and belongs together and which of the tags to
ignore.
You can probably get what you want by using something like the StringBean
class which extracts just the text. I think you'll find the output of the
StringBean will have the word "northern" rather than separating it, i.e. it
doesn't inject any whitespace. You can use similar code in your program to
paste the text back together.
Derrick
On Thu, Jun 30, 2011 at 11:39 AM, Chris Bamford <cba...@mi...>wrote:
> Hi there,
>
> I use Aperture to extract text which runs Htmlparser when processing HTML.
> My question relates to the handling of presentation tags such as <u>, <b>,
> <i> when embedded within words - for example:
>
> <html><body><u>north</u>ern</body></html>
>
> What I would expect is that I should be delivered the word "northern" - but
> instead I get two tokens: "north" and "ern", which is clearly wrong in this
> context.
> It seems that Htmlparser is replacing tags with whitespace - why is this?
>
> Thanks for any help.
>
> - Chris
>
>
>
> ------------------------------------------------------------------------------
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and makes
> sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-d2d-c2
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
|