"Creation of dynamic property" warning in PHP 8.2 (version 1.9.1)
Thanks for your bug report. This is actually a typo. The variable should be called $optional_closing_tags. There is a recent commit in master that illustrates the fix. This should also work in PHP 8.2 and higher. [8dc21bcb714c4edcb4318bdc3f198f4f78762381]
Incorrect handling of <br> tags next to line breaks
Looks good now! However, you must set the Unicode flag, or else preg_replace() may return an invalid Unicode string, which may cause the second preg_replace() to return NULL, and a deprecation error for the third preg_replace(). Good catch. Fixed via [b8d048e46b7f1964c28ea041d39ccb1d05f9a0ed]. And about the manual: I see now that the navigation sidebar is aligned far down upon page load, so that only the documentation for the functions (isset etc.) is immediately visible, not the more useful "Quick...
HtmlNode: Replace and collapse unicode whitespace in plaintext
PS! Would be nice if you could link to the manual from the "Support" section, because it was hard to find. https://sourceforge.net/projects/simplehtmldom/support Turns out that page is managed by SF. There is no way to change the contents of that page 😔 I added a "Manual" tab instead.
PS! Would be nice if you could link to the manual from the "Support" section, because it was hard to find. https://sourceforge.net/projects/simplehtmldom/support Good idea! I'll do that. The space thing works now, but the BR tag is still not handled well. Try the code in the original post and compare the output when (un)-commenting the commented line. I'm comparing the output of plaintext with what is displayed in the browser and it looks exactly the same. Please note that I have removed wordwrap()...
[67c0f4e21091a9cc66151610a653724a0acb1b69] fixes the whitespace issue. Let me know if this works for you.
HtmlNode: Replace and collapse whitespace in plaintext
Shouldn't plaintext convert newlines to spaces? Did you change this recently? Surely this is a bug/regression, or am I missing something completely? The plaintext implementation is completely rewritten but it passes all tests. Your particular case probably isn't covered by any of the tests right now. I'll check this as well. At the very least <br> seems to work right. PS! Where on the SourceForge page is the link to the manual (the one with the clickable tabs with examples, etc.)? I hope you didn't...
Please try again with current master. From what I can tell, the output looks right: ***** ** ********,*** ****** ********. ******* **** *** ** ** ******* *** *** ****** ***. *** **** *** ****** ****. *.***. ** **** *** ***** **** ***** ********* *** ************ ** *** *** *** *** ** ****. *** ** ****, ******** ******* *** ******** ********** * ** ******* ****. ******** ** *** *** **** ***********, ** *** *** *** * *** *** ********* .*** ***** ******* *** ** **** ***.*** ****** *** ** **** *** ********...
iconv() detected an illegal character in input string
This is fixed via [c53a612e6fe61d5b1efc0c3270e20aa34e4e84ee]. Instead of using //IGNORE, it needs to be wrapped inside a try-catch block, so that the character set is detected properly. Eventually, this will be replaced by a better solution, but this works for now. Thanks again for reporting!
HtmlDocument: Let the parser decode entities
HtmlNode: Fix empty if-statement
HtmlDocument: Inline token_equal, _slash, and _attr
HtmlDocument: Use try-catch block for iconv
docs: Include recent changes
HtmlDocument: Don't use magic functions
HtmlDocument: Fix broken $forceTagsClosed = false
HtmlDocument: Use native functions for tag names and attribute values
HtmlNode: Stop removing UTF-8 BOM from the end of a string
HtmlDocument: Add shortcuts for the parser
HtmlDocument: Inline skip method
HtmlDocument: Use shortcuts for seek methods
Fix memory parsing test
HtmlDocument: Don't assign nodes by reference
HtmlDocument: Don't remove noise before parsing.
Incorrect handling of <br> tags next to line breaks
Thanks for reporting. I fixed your original message. You are right, the current implementation of <br> is wrong. I haven't tested this yet but it should give slightly better results if you define DEFAULT_BR_TEXT like this: define("DEFAULT_BR_TEXT", PHP_EOL) At the very least, this makes it platform independent. That said, there is additional work to do in the parser to handle all cases (like the <br> a case).
iconv() detected an illegal character in input string
Thanks for reporting. It took me a while to figure out what is going on. Am I right to assume that you are running on PHP 8.x? In previous versions that error would not have been reported because of the error suppression operator (@). (Un-)fortunately the behavior of this operator changed in PHP 8: https://php.watch/versions/8.0/fatal-error-suppression The behavior for //IGNORE depends on the specific implementation of iconv, some of which completely ignore this flag. Still, this is a good hack to...
PHP 7 .x compatibility
parsing stops after first multibyte character
Comments on MAX_FILE_SIZE
Notify when zero elements were found
Role attribute
Slashdot example updated
Thanks for the feedback! The example in 1.9 is probably not functional anymore, but there is an updated version in the current master that still works. Here is the link for future reference: https://sourceforge.net/p/simplehtmldom/repository/ci/master/tree/example/scraping/example_scraping_slashdot.php
Is this project active anymore?
That choice is entirely up to you.
How to avoid break on 404 errors?
Good to know you found a solution :)
Traversing the Dom within a series of columns
You probably figured it out in the mean time, but here is a complete example that will give you what you want. <?php include_once 'simple_html_dom.php'; $doc = <<<EOD <tr> <td></td> <td id="column2" class="style3">A</td> <td id="column2" class="style2">B</td> <td> <a href="#link")>Description of Link</a> </td> </tr> EOD; $html = str_get_html($doc); $href = $html->find('a', 0)->href; $description = $html->find('a', 0)->innertext; echo $href . PHP_EOL . $description . PHP_EOL; // #link // Description...
Removing tags does not work
This is probably no longer relevant but the for loop in your example indexes over the value of the first element instead of all script elements. foreach($items->find('script',0) as $e) { $e->outertext = ''; echo '$e: ' . $e . '<br/>'; } Notice the ,0 in ->find('script',0). This is why the error occurs. Here is the correct version: foreach($items->find('script') as $e) { $e->outertext = ''; echo '$e: ' . $e . '<br/>'; }
Timezone change
Uncaught Error: Call to a member function find() on string in ... Stack trace: #0 {main} thrown in
HTTP Request failed
raspado a un script de json
Missing whitespace in plaintext property
966c5e39493eff7dc1eb77e0004bdc0015037b34 fixes various issues related to spaces and line breaks when generating plain text. For all the examples provided here, it produces the correct output. It also properly collapses superfluous spaces and line breaks, so that the output should be much more readable, especially for awkwardly formatted HTML documents.
HtmlNode: Improve plain text output of text()
I completely misunderstood the original report and finally figured it out. Thanks for all your feedback! This issue is resolved via [d6dcf50d6b03eb1d0c575abb7011abb658fefcf1] [4ad20901f0e63356cb3eb15a1cf4d9bf3a9837cc] by comparing the string length with PHP_MAXPATHLEN before calling is_file(). Edit: Had to fix the fix because the original fix was broken :)
HtmlDocument: Fixing the fix :)
I completely misunderstood the original report and finally figured it out. Thanks for all your feedback! This issue is resolved via [d6dcf50d6b03eb1d0c575abb7011abb658fefcf1] by comparing the string length with PHP_MAXPATHLEN before calling is_file().
is_file(): File name is longer than the maximum allowed path length on this platform (4096)
HtmlDocument: Check PHP_MAXPATHLEN before is_file()
Fix large file parsing test
Fix spelling mistakes.
HtmlNode: Optimize control flow for seek().
HtmlNode: Simplify charset checks before calling iconv.
HtmlNode: Breakup complex if-else-statement into more readable chunks.
Remove unnecessary curly braces syntax.
Cleanup duplicate branches in switch statements.
HtmlNode: Verify that constructor argument is instance of HtmlDocument
HtmlDocument: Simplify regex expressions
docs: Fix broken page links
phpunit: Remove unnecessary default value assignment
examples: Initialize $data variable
HtmlNode: Reduce complexity of CSS selector regex
docs: Fix table formatting in markdown files
docs: Move page titles to mkdocs.yml
docs: Update Google Analytics to G4 and display prev/next buttons
Reorganize docs
HtmlNode: Use only HtmlElement to determine block-level elements
Use HtmlElement::isRawTextElement() in HtmlDocument and HtmlNode
Add new class to handle HTML elements
Fixed character translation error in iconv()
I can see how this is annoying. Unfortunately, UTF-8//IGNORE silently discards characters that cannot be represented in the target charset, which may result in incorrect output. As you already know, UTF-8//TRANSLIT also doesn't always work and heavily depends on the actual implementation of iconv and system settings (some of which completely ignore //TRANSLIT). This unreliability of iconv is why it is better to have a notice reported here and leave the choice to the caller. You can actually override...
Pseudo-classes are currently not supported. Refer to the documentation for the find method for a list of supported selectors. In this case, I suggest using the lastChild method, which will give you the same result as :last-child.
:last-child selector doesn't work
Wrong variable name at str_get_html
Patch for PHP 8
Patch for PHP 8
Great, glad to hear it works for you :) The reason for adding the condition over changing the default values is to make sure it works even when a caller passes null as an argument.
.github: Add PHP compatibility check to workflow
phpcompatibility: Update and clarify compatibility standards
composer: Downgrade phpcs to version 2.x
I see, that makes sense. Thanks for clarifying. This is actually a deprecation warning and not an error. It occurs when calling trim(null) (in the case that $str = null) because trim() expects a non-nullable string. This warning was added in PHP 8.1: https://www.php.net/releases/8.1/en.php#deprecations_and_bc_breaks Passing null to non-nullable internal function parameters is deprecated. There are likely other places that are affected by this. This particular case, however, is fixed in [1765ac4494a05d5c84408398127e6539f6bc1238]....
HtmlDocument: Don't pass null to trim()
Possibly XSS vulnerability
Thanks for reporting this issue. While I agree that this is a bug in the attribute handler, it is not a XSS vulnerability, at least not for this project. This issue is fixed in [a706de9bcb3b74ad10e04cc0b2de0d1b35007ab4]
HtmlNode: Add quotes to unquoted attribute value depending on content
README: Replace Travis-CI bage by GitHub Workflow badge
Wrong variable name at str_get_html
The parameter name looks correct to me $str. What version of the library are you using? function str_get_html( $str, $lowercase = true, $forceTagsClosed = true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN = true, $defaultBRText = DEFAULT_BR_TEXT, $defaultSpanText = DEFAULT_SPAN_TEXT) { $dom = new simple_html_dom( null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText ); if (empty($str) || strlen($str) > MAX_FILE_SIZE) { $dom->clear(); return false;...