Menu

#53 Removing tags does not work

v1.0_(example)
closed
None
1
2022-04-09
2020-08-27
Matthias
No

I am trying to parse XML to get some news items from a RSS feed. The parsing works, although in some items, there is code that I would like to remove.

In this part here:

<content:encoded><![CDATA[<p>1950 nahm in Argentinien eine neue Fluglinie den Betrieb auf. Aerolíneas Argentinas entstand damals durch den Zusammenschluss von gleich vier Airlines: Alfa, Zonda, Fama und Aeroposta. 2020 feiert die Nationalairline deshalb ihr 70-jähriges Bestehen.</p>

<p>Anlässlich dieses Jubiläums wird die Fluglinie der Boeing 737-700 mit dem Kennzeichen LV-GOO eine historische Bemalung verpassen. Der Jet wird die gleiche Lackierung tragen wie in den 1980er-Jahren die Boeing 747, die unterem 1982 Papst Johannes Paul II. beförderten und 1986 das argentinische Fußball-Weltmeister-Team um Superstar Diego Maradona.</p>

<p>Aerolíneas Argentinas wird den Lackierprozess mit einer Reihe von Videos begleiten. Das erste zeigt das Flugzeug noch in seiner bisherigen Bemalung:</p>

<style>.embed-container { position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; max-width: 100%; } .embed-container iframe, .embed-container object, .embed-container embed { position: absolute; top: 0; left: 0; width: 100%; height: 100%; }</style>

<div class="embed-container">
<blockquote class="twitter-tweet">
<p dir="ltr" lang="es">“Hay un único lugar donde ayer y hoy se encuentran, se reconocen y se abrazan. Ese lugar es mañana.” Eduardo Galeano. <a href="https://t.co/30XpyZZ2tz">pic.twitter.com/30XpyZZ2tz</a></p>
<p>— Aerolíneas Argentinas (@Aerolineas_AR) <a href="https://twitter.com/Aerolineas_AR/status/1297884439247282176?ref_src=twsrc%5Etfw">August 24, 2020</a></p></blockquote>
<p><script async src='https://platform.twitter.com/widgets.js' charset='utf-8'></script></p>
</div>

]]></content:encoded>

I would like to remove the whole part for <script></script>.

My page looks as follows:

   include_once('../includes/simple_html_dom.php');        // Include the library
   $html = file_get_html(https://www.aerotelegraph.com/feed);                            // Retrieve the DOM from a given URL
   $n = 0;
   foreach($html->find('item') as $items) {
      $title = $items->find('title',0)->innertext;
      $lnk   = $items->find('link',0)->innertext;
      $desc  = $items->find('description',0)->plaintext;
      $desc  = (string) simplexml_load_string("<x>$desc</x>");   // to remove the CDATA
      $img   = $items->find('image',0)->innertext;
      $pdat  = $items->find('pubDate',0)->innertext;

      $cont  = $items->find('content:encoded',0)->innertext;
      $cont  = (string) simplexml_load_string("<x>$cont</x>");
      foreach($items->find('script',0) as $e) {
         $e->outertext = '';
         echo '$e: ' . $e . '<br/>';
      }

      if ($img == '') {
         $desc = substr($desc, 0, 36) . '...';
         echo '<div class="row"><h5><a href="' . $lnk . '" target="news">' . $title . '</h5></a><div class="tltip">' . $desc . '<span class="tltiptext">' . $cont . '</span></div><br/><span class="pdat">' . $pdat . '</span></div>' . "\r\n";
      } else {
         echo '<div class="row"><h5><a href="' . $lnk . '" target="news">' . $title . '</h5></a><div class="tltip">' . $desc . '<img src="' . $img . '" width="200"><span class="tltiptext">' . $cont . '</span></div><br/><span class="pdat">' . $pdat . '</span></div>' . "\r\n";
      }

      if (++$n == $nar) break;
   }   

I'm trying to remove the style part here as per the example I saw

// Example:
// remove all image
// foreach($html->find('img') as $e)
//     $e->outertext = '';

      foreach($items->find('script',0) as $e) {
         $e->outertext = '';
      }

Yet this does not remove the <script> tag. If I change the $items to $cont as the part to remove is in the $cont variable, it throws an error </p> <p>'PHP message: PHP Fatal error: Uncaught Error: Call to a member function find() on string in'</p> <p>Is the $cont variable not holding a valid HTML string that I can work with?</p></script>

Discussion

  • LogMANOriginal

    LogMANOriginal - 2022-04-09
    • status: open --> closed
    • assigned_to: LogMANOriginal
     
  • LogMANOriginal

    LogMANOriginal - 2022-04-09

    This is probably no longer relevant but the for loop in your example indexes over the value of the first element instead of all script elements.

          foreach($items->find('script',0) as $e) {
             $e->outertext = '';
             echo '$e: ' . $e . '<br/>';
          }
    

    Notice the ,0 in ->find('script',0). This is why the error occurs. Here is the correct version:

          foreach($items->find('script') as $e) {
             $e->outertext = '';
             echo '$e: ' . $e . '<br/>';
          }
    
     

Log in to post a comment.

MongoDB Logo MongoDB
Gen AI apps are built with MongoDB Atlas
Atlas offers built-in vector search and global availability across 125+ regions. Start building AI apps faster, all in one place.