Re: [Magpierss-general] Unwanted characters and encoded characters.
Status: Beta
Brought to you by:
kellan
From: Alan L. <ala...@do...> - 2005-08-15 00:38:08
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> Dave Cocuzzi wrote: <blockquote cite="mid...@ge..." type="cite"> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> I have been using Magpie pretty much out of the box changing only some display options etc. in my own scripts. Everything seems to be working pretty darn well and I am so glad that Magpie is available. Great great little app. That said, there are couple of things that I have been trying to work out that have thus far stymied me and I am hoping that someone can bail me out.<br> <br> First off, from time to time I run into a feed where question mark will show up where there is punctuation in the actual xml. One example is the Huffington Post feeds (<a href="http://www.huffingtonpost.com/thenewswire/full_atom.xml">http://www.huffingtonpost.com/thenewswire/full_atom.xml</a>). They publish RSS 1.0 and ATOM feeds and both have their own peculiarities. When I use the RSS1.0 feeds a question mark appears in my page where there is an apostrophe or an ellipsis in the actual feed. See (<a href="http://localhost/xampp/geckosfeet/website/national.php">http://www.geckosfeet.com/national.php</a>). <br> </blockquote> Its character encodings. The Buffington site uses UTF-8 encoding, and what you are seeing are likely MS Word induced curly quotes and ellipses. On your site, your HTML HEAD contains:<br> <br> <pre id="line24"><<span class="start-tag">meta</span><span class="attribute-name"> http-equiv</span>=<span class="attribute-value">"charset" </span><span class="attribute-name">content</span>=<span class="attribute-value">"iso-8859-1" </span><span class="attribute-name">/</span>> </pre> While Buffington's is<br> <pre id="line24"><<span class="start-tag">meta</span><span class="attribute-name"> http-equiv</span>=<span class="attribute-value">"content-type" </span><span class="attribute-name">content</span>=<span class="attribute-value">"text/html; charset=utf-8" </span><span class="attribute-name">/</span>></pre> <br> If you switched the character set on at least your one Magpie page to:<br> <pre id="line24"><<span class="start-tag">meta</span><span class="attribute-name"> http-equiv</span>=<span class="attribute-value">"charset" </span><span class="attribute-name">content</span>=<span class="attribute-value">"utf-8" </span><span class="attribute-name">/</span>></pre> <br> and modify your Magpie rss_fetch.inc to specify utf-8 output:<br> <br> define('MAGPIE_OUTPUT_ENCODING', 'UTF-8');<br> <br> <br> I bet it will display as you expect.<br> <br> <blockquote cite="mid...@ge..." type="cite"><font color="#3333ff"></font>Also, I have been using the strip_tags() to remove html tags that are embedded in xml. However, occasionally someone will embed encoded characters like &lt; i &gt; or &amp;. Does anyone know a simple way to strip out these encoded characters? Barring a simple way, would I have to parse through the strings and do some regular expression matching?<br> </blockquote> Run it through html_entity_decode() -- see <a class="moz-txt-link-freetext" href="http://us3.php.net/manual/en/function.html-entity-decode.php">http://us3.php.net/manual/en/function.html-entity-decode.php</a><br> <br> <pre class="moz-signature" cols="72">-- \\ alan levine (<a class="moz-txt-link-abbreviated" href="mailto:ala...@do...">ala...@do...</a>) // "once geologist, now technologist" \\ maricopa community colleges, arizona // mcli web: <a class="moz-txt-link-freetext" href="http://www.mcli.dist.maricopa.edu/">http://www.mcli.dist.maricopa.edu/</a> \\ cogdogblog: <a class="moz-txt-link-freetext" href="http://jade.mcli.dist.maricopa.edu/cdb/">http://jade.mcli.dist.maricopa.edu/cdb/</a> </pre> </body> </html> |