[Htmlparser-developer] Re: [Htmlparser-user] getTitle() case sensitive problem
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2004-01-27 14:13:46
|
Ayhan, I think it's a good idea for the parser to always use an english Locale when converting to upper case for tag names and attribute names. Thanks for pointing that overload out. I've identified about a dozen places where the String.toUppercase() is performed that should use a locale. I will implement this when the CVS system comes back online. You can track this with bug #883664 toUpperCase on tag names and attributes depends on locale. I'm surprised though, that your experiment didn't also find the end tag, which uses the same mechanism. Can you provide a URL that would illustrate the problem. My suspicion is the title tags are (incorrectly) built with Turkish characters. Perhaps you need to overload the TitleTag to recognize various combinations of English and Turkish characters, i.e. Title, TITLE, T\u0130TLE and T\u0131tle. These would convert to uppercase differently, based on the locale used. So you might define: public class TurkishTitleTag extends TitleTag { /** * The set of names handled by this tag. */ private static final String[] mIds = new String[] {"TITLE", "T\u0130TLE", "T\u0131TLE"}; /** * The set of tag names that indicate the end of this tag. */ private static final String[] mEnders = new String[] {"TITLE", "T\u0130TLE", "T\u0131TLE", "BODY"}; /** * Return the set of names handled by this tag. * @return The names to be matched that create tags of this type. */ public String[] getIds () { return (mIds); } /** * Return the set of tag names that cause this tag to finish. * @return The names of following tags that stop further scanning. */ public String[] getEnders () { return (mEnders); } } Then you would substitute this tag for the normal TitleTag using something like: ProtoTypicalNodeFactory factory = new ProtoTypicalNodeFactory (); factory.registerTag (new TurkishTitleTag ()); parser.setNodeFactory (factory); That should find all combinations of title, even without the toUppercase(locale) change, unless, of course, there are other special Turkish characters that need to be handled. Derrick p.s. Can you also provide a URL for the style problem. Ayhan Peker wrote: > Derrick, > i have tried 1.4..it is the same..So i am unable to extract titles if > they are lowercase..I have visited the link again regarding style > problem ...it is the same too..i am still getting style content with > the text..:( > > > Ayhan > > > > */Derrick Oswald <Der...@Ro...>/* wrote: > > > Ayhan, > > First off, you should probably switch to the 1.4 version, since > much has > changed since 1.3. > You should be able to use any locale when running the parser, if not > I'll try to fix it. > > It's likely that the page with the Turkish title should have used : > > (or whatever lang name it is), or the title tag should follow the > META > tag that sets the character encoding. > > Derrick > > Ayhan Peker wrote: > > > Derrick, > > Thanks for the quick response > > > > I think i know where the problem is: > > > > I set my linux to turkish locale. In turkish there are two 'i's one > > with dot on it and one without a dot. so TITLE IN TURKISH DOES NOT > > EQUAL TITLE IN ENGLISH. TitleScanner was returning false when > > "title".toUppercase().equals("TITLE").. > > > > I set the default locale at the begining of the thread to > english..It > > solved the problem..However I need turkish locale for the text (for > > database) > > I tried to modify TitleScanner by adding Locale > > ( "title".toUppercase(Locale("en")).equals("TITLE").) It finds the > > title but fails to detect > > > > Is there a way i can get the title fields without setting the > default > > locale to english? > > > > Regarding style content coming with text: > > have a look at the url: http://www.metu.edu.tr/about/admins.php > > > > Thanks > |