When attempting to parse all Heading tags of certain sites, there's an opening Heading tag in <script> data, and forceTagsClosed doesn't seem to apply.</p> <p>How to avoid this H tag parsing of the script tag?</p> <p>Example site: techcrunch dot com with H2 wrapped in script tags (JSON data).</p> <p>find('H2') for the above site example collects garbage.</p> <p>Amazing script, btw! Please advise.</p></script>
Any chance I can get an answer on this? Seems there's a function remove_noise() but ALL the script data is still in the final output. How do you get the cleaned output?
Sorry for the late reply. Today I investigated this issue. As it turns out, the parser is working fine, the webpage simply contains much more payload than normally expected.
The main page of techcrunch contains a 1.8 MB script payload. This is way too big for the standard settings. Here is a script that works for me:
Let me know if this works for you.
Last edit: LogMANOriginal 2019-09-21
Thank you so much! I did increase the file size limit, but I wasn't aware of the regex limit. Increasing the PCRE limit fixed my script. Much appreciated!
Also, quick note: If you're on StackOverflow and wanted to answer there, I'll immediately accept the answer https://stackoverflow.com/questions/57966246/php-simple-html-dom-parser-remove-script-data
I'm glad to hear it works for you. I appreciate the offer for StackOverflow, but I'm not active on that platform. Thanks anyway.
Thanks again. I'll post the solution you provided to Stackoverflow and reference this ticket.