I am trying to parse XML to get some news items from a RSS feed. The parsing works, although in some items, there is code that I would like to remove.
In this part here:
<content:encoded><![CDATA[<p>1950 nahm in Argentinien eine neue Fluglinie den Betrieb auf. Aerolíneas Argentinas entstand damals durch den Zusammenschluss von gleich vier Airlines: Alfa, Zonda, Fama und Aeroposta. 2020 feiert die Nationalairline deshalb ihr 70-jähriges Bestehen.</p>
<p>Anlässlich dieses Jubiläums wird die Fluglinie der Boeing 737-700 mit dem Kennzeichen LV-GOO eine historische Bemalung verpassen. Der Jet wird die gleiche Lackierung tragen wie in den 1980er-Jahren die Boeing 747, die unterem 1982 Papst Johannes Paul II. beförderten und 1986 das argentinische Fußball-Weltmeister-Team um Superstar Diego Maradona.</p>
<p>Aerolíneas Argentinas wird den Lackierprozess mit einer Reihe von Videos begleiten. Das erste zeigt das Flugzeug noch in seiner bisherigen Bemalung:</p>
<style>.embed-container { position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; max-width: 100%; } .embed-container iframe, .embed-container object, .embed-container embed { position: absolute; top: 0; left: 0; width: 100%; height: 100%; }</style>
<div class="embed-container">
<blockquote class="twitter-tweet">
<p dir="ltr" lang="es">“Hay un único lugar donde ayer y hoy se encuentran, se reconocen y se abrazan. Ese lugar es mañana.” Eduardo Galeano. <a href="https://t.co/30XpyZZ2tz">pic.twitter.com/30XpyZZ2tz</a></p>
<p>— Aerolíneas Argentinas (@Aerolineas_AR) <a href="https://twitter.com/Aerolineas_AR/status/1297884439247282176?ref_src=twsrc%5Etfw">August 24, 2020</a></p></blockquote>
<p><script async src='https://platform.twitter.com/widgets.js' charset='utf-8'></script></p>
</div>
]]></content:encoded>
I would like to remove the whole part for <script></script>.
My page looks as follows:
include_once('../includes/simple_html_dom.php'); // Include the library
$html = file_get_html(https://www.aerotelegraph.com/feed); // Retrieve the DOM from a given URL
$n = 0;
foreach($html->find('item') as $items) {
$title = $items->find('title',0)->innertext;
$lnk = $items->find('link',0)->innertext;
$desc = $items->find('description',0)->plaintext;
$desc = (string) simplexml_load_string("<x>$desc</x>"); // to remove the CDATA
$img = $items->find('image',0)->innertext;
$pdat = $items->find('pubDate',0)->innertext;
$cont = $items->find('content:encoded',0)->innertext;
$cont = (string) simplexml_load_string("<x>$cont</x>");
foreach($items->find('script',0) as $e) {
$e->outertext = '';
echo '$e: ' . $e . '<br/>';
}
if ($img == '') {
$desc = substr($desc, 0, 36) . '...';
echo '<div class="row"><h5><a href="' . $lnk . '" target="news">' . $title . '</h5></a><div class="tltip">' . $desc . '<span class="tltiptext">' . $cont . '</span></div><br/><span class="pdat">' . $pdat . '</span></div>' . "\r\n";
} else {
echo '<div class="row"><h5><a href="' . $lnk . '" target="news">' . $title . '</h5></a><div class="tltip">' . $desc . '<img src="' . $img . '" width="200"><span class="tltiptext">' . $cont . '</span></div><br/><span class="pdat">' . $pdat . '</span></div>' . "\r\n";
}
if (++$n == $nar) break;
}
I'm trying to remove the style part here as per the example I saw
// Example:
// remove all image
// foreach($html->find('img') as $e)
// $e->outertext = '';
foreach($items->find('script',0) as $e) {
$e->outertext = '';
}
Yet this does not remove the <script> tag. If I change the $items to $cont as the part to remove is in the $cont variable, it throws an error </p> <p>'PHP message: PHP Fatal error: Uncaught Error: Call to a member function find() on string in'</p> <p>Is the $cont variable not holding a valid HTML string that I can work with?</p></script>
This is probably no longer relevant but the for loop in your example indexes over the value of the first element instead of all script elements.
Notice the
,0in->find('script',0). This is why the error occurs. Here is the correct version: