PHP Simple HTML DOM Parser / Support Requests / #47 Filtering Noise from Script Tag (JSON)

Dario Zadro - 2019-09-17

Any chance I can get an answer on this? Seems there's a function remove_noise() but ALL the script data is still in the final output. How do you get the cleaned output?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sorry for the late reply. Today I investigated this issue. As it turns out, the parser is working fine, the webpage simply contains much more payload than normally expected.

The main page of techcrunch contains a 1.8 MB script payload. This is way too big for the standard settings. Here is a script that works for me:

<?php
// Normal regex doesn't work for such a large script section. The regex parser hits the backtrack limit, so we need to increase it temporarily.
ini_set('pcre.backtrack_limit', '10485760'); // 10MB

// The total file size of the webpage is also much bigger than normally allowed. Use this definition to increase the upper boundary of the parser.
define('MAX_FILE_SIZE', 10485760); // 10MB

include_once 'simple_html_dom.php';
$html = file_get_html('https://techcrunch.com/');

// This will list all headers
foreach($html->find('h2') as $h2) echo $h2->plaintext . PHP_EOL;

// Use this code to remove script tags from the DOM (i.e. if you need to save the DOM for later use. In my tests the file size went down from 1.9MB to 30KB
foreach($html->find('script') as $script) $script->remove();

Let me know if this works for you.

Last edit: LogMANOriginal 2019-09-21

Dario Zadro - 2019-09-22

Thank you so much! I did increase the file size limit, but I wasn't aware of the regex limit. Increasing the PCRE limit fixed my script. Much appreciated!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dario Zadro - 2019-09-22

Also, quick note: If you're on StackOverflow and wanted to answer there, I'll immediately accept the answer https://stackoverflow.com/questions/57966246/php-simple-html-dom-parser-remove-script-data

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-10-03

I'm glad to hear it works for you. I appreciate the offer for StackOverflow, but I'm not active on that platform. Thanks anyway.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-10-03

status: open --> closed

assigned_to: LogMANOriginal
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dario Zadro - 2019-10-03

Thanks again. I'll post the solution you provided to Stackoverflow and reference this ticket.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Filtering Noise from Script Tag (JSON)

A php based DOM parser.

Group

Searches

Help

#47 Filtering Noise from Script Tag (JSON)

Discussion