Menu

#47 Filtering Noise from Script Tag (JSON)

v1.0_(example)
closed
None
1
2019-10-03
2019-08-23
Dario Zadro
No

When attempting to parse all Heading tags of certain sites, there's an opening Heading tag in <script> data, and forceTagsClosed doesn't seem to apply.</p> <p>How to avoid this H tag parsing of the script tag?</p> <p>Example site: techcrunch dot com with H2 wrapped in script tags (JSON data).</p> <p>find('H2') for the above site example collects garbage.</p> <p>Amazing script, btw! Please advise.</p></script>

Discussion

  • Dario Zadro

    Dario Zadro - 2019-09-17

    Any chance I can get an answer on this? Seems there's a function remove_noise() but ALL the script data is still in the final output. How do you get the cleaned output?

     
  • LogMANOriginal

    LogMANOriginal - 2019-09-21

    Sorry for the late reply. Today I investigated this issue. As it turns out, the parser is working fine, the webpage simply contains much more payload than normally expected.

    The main page of techcrunch contains a 1.8 MB script payload. This is way too big for the standard settings. Here is a script that works for me:

    <?php
    // Normal regex doesn't work for such a large script section. The regex parser hits the backtrack limit, so we need to increase it temporarily.
    ini_set('pcre.backtrack_limit', '10485760'); // 10MB
    
    // The total file size of the webpage is also much bigger than normally allowed. Use this definition to increase the upper boundary of the parser.
    define('MAX_FILE_SIZE', 10485760); // 10MB
    
    include_once 'simple_html_dom.php';
    $html = file_get_html('https://techcrunch.com/');
    
    // This will list all headers
    foreach($html->find('h2') as $h2) echo $h2->plaintext . PHP_EOL;
    
    // Use this code to remove script tags from the DOM (i.e. if you need to save the DOM for later use. In my tests the file size went down from 1.9MB to 30KB
    foreach($html->find('script') as $script) $script->remove();
    

    Let me know if this works for you.

     

    Last edit: LogMANOriginal 2019-09-21
  • Dario Zadro

    Dario Zadro - 2019-09-22

    Thank you so much! I did increase the file size limit, but I wasn't aware of the regex limit. Increasing the PCRE limit fixed my script. Much appreciated!

     
  • LogMANOriginal

    LogMANOriginal - 2019-10-03

    I'm glad to hear it works for you. I appreciate the offer for StackOverflow, but I'm not active on that platform. Thanks anyway.

     
  • LogMANOriginal

    LogMANOriginal - 2019-10-03
    • status: open --> closed
    • assigned_to: LogMANOriginal
     
  • Dario Zadro

    Dario Zadro - 2019-10-03

    Thanks again. I'll post the solution you provided to Stackoverflow and reference this ticket.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.