DELETE THIS BUG REPORT
DELETE THIS BUG REPORT
DELETE THIS BUG REPORT
DELETE THIS BUG REPORT
DELETE THIS BUG REPORT
DELETE THIS REQUEST
DELETE THIS FEATURE REQUEST
DELETE THIS FEATURE REQUEST
DELETE THIS FEATURE REQUEST
DELETE THIS FEATURE REQUEST
Ok. Wrong test. The method it's called like $sections = $html->find('section')->firstChild(); but I got the same an error because it's an Array. So there's not the same as the CSS pseudo selector rule :first-child How to use it to get the same result as CSS? TNX
Ok. Wrong test. The method it's called like $sections = $html->find('section')->firstChild(); but I got the same an error because it's an Array. So there's not the same as the CSS pseudo selector rule :first-child How to use it to get the same result as CSS? TNX
Ok. Wrong test. The method it's called like $sections = $html->find('section')->firstChild(); but I got the same an error because it's an Array. So there's not the same as the CSS pseudo selector rule :first-child How to use it to get the same result? TNX
Ok. Wrong test. The method it's called like $sections = $html->find('section')->firstChild(); but I got the same an error because it's an Array. So there's not the same as the CSS pseudo selector rule first-child
Sorry, wrong goal. Close this. The correct answer it's here: https://sourceforge.net/p/simplehtmldom/support-requests/63/
Get the elements of the upper level like with CSS pseudo selector :first-child
Find first child element like CSS does not respect order
Decoding HTML entities corrupts text in HTML
This bug persists even with well-formed HTML with single root element: <?php $s_htm = <<<EOT <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <body> <div class="c1"></div> <div class="c2"></div> </body> </html> EOT; ...
$node->find() finds element next to $node
I've also submitted a patch to address this issue here: https://sourceforge.net/p/simplehtmldom/feature-requests/68/
Added feature to enable/disable htmlentity operations
The output doesn't match the input even when the input hasn't been modified
PHP 8.x support
Get only text in leaf nodes (avoid duplication)
Is Github page safe to use for downloads?
End tags erroneously included in plaintext
There is a typo in the command, missing the "p" in the second "simple." shoulde be composer require simplehtmldom/simplehtmldom dev-master
This is not a bug in simplehtmldom Yes it is. You're not setting a user agent in either the curl code or the stream_context code. Any properly configured server will reject the requests, which makes the project useless. You need to either add a generic user agent (recommend google bot) or provide a way for the user to pass in their own user agent to the function. See lines 72 and 111 of revised HtmlWeb.php file.
Actually it has to be done something like this, because this function can be called from inside the library, and we want to get the first call that is outside the library. PS! Is the maintainer active these days? Has been quiet for a while. diff --git a/HtmlNode.php b/HtmlNode.php index 9649d37..99dbda4 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -549,3 +554,12 @@ class HtmlNode { - return $this->find($selector, $idx, $lowercase) ?: null; + if(!$element = $this->find($selector, $idx, $lowercase))...
diff --git a/HtmlNode.php b/HtmlNode.php index 9649d37..aef8b17 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -549,3 +549,12 @@ class HtmlNode + function first($selector, $idx = 0, $lowercase = false) + { + return $this->expect($selector, $idx, $lowercase); + } Missed semicolon and preformatting.
Convenience function for getting first element
diff --git a/simple_html_dom.php b/simple_html_dom.php index bce4d9e..97d6e1d 100644 --- a/simple_html_dom.php +++ b/simple_html_dom.php @@ -117,3 +117,3 @@ function file_get_html( $dom->clear(); - return false; + $contents = ""; } @@ -144,5 +144,4 @@ function str_get_html( $dom->clear(); - return false; + $contents = ""; } - return $dom->load($str, $lowercase, $stripRN); Better version with tabs. PS! The inline editor and preview function on the site seems to hide the first line of content :|
Always tell user where he expected non-existing element
Preg_match error occurs after saving new contentblock
Never return false on documents
I see. One would use $e->innertext to get the text inside the tag.
No anchor text returned
"Creation of dynamic property" warning in PHP 8.2 (version 1.9.1)
Thanks for your bug report. This is actually a typo. The variable should be called $optional_closing_tags. There is a recent commit in master that illustrates the fix. This should also work in PHP 8.2 and higher. [8dc21bcb714c4edcb4318bdc3f198f4f78762381]
disregard
If I am understanding attribute selectors, this is not working again.. * ^ and $ all return 2 Example; echo count( str_get_html('<html><body><span class="first second">Hello!</span><span id="third">ME OH MI!</span></body></html>')->find('span[class^=second]') ); I have been trying to use attribute selectors to try and 'find' a div with an id with random numbers and -slideshow for the value (ex. 8099435804-slideshow) and I haven't been able to get it to work. ~in my case it returns all div's in the...
If I am understanding attribute selectors, this is not working again.. * ^ and $ all return 2 Example; echo count( str_get_html('<html><body><span class="first second">Hello!</span><span id="third">ME OH MI!</span></body></html>')->find('span[class^=second]') ); I have been trying to use attribute selectors to try and 'find' a div with an id with random numbers and -slideshow for the value (ex. 8099435804-slideshow) and I haven't been able to get it to work. ~in my case it returns all div's in the...
"Creation of dynamic property" warning in PHP 8.2 (version 1.9.1)
Incorrect handling of <br> tags next to line breaks
Looks good now! However, you must set the Unicode flag, or else preg_replace() may return an invalid Unicode string, which may cause the second preg_replace() to return NULL, and a deprecation error for the third preg_replace(). Good catch. Fixed via [b8d048e46b7f1964c28ea041d39ccb1d05f9a0ed]. And about the manual: I see now that the navigation sidebar is aligned far down upon page load, so that only the documentation for the functions (isset etc.) is immediately visible, not the more useful "Quick...
HtmlNode: Replace and collapse unicode whitespace in plaintext
Looks good now! However, you must set the Unicode flag, or else preg_replace() may return an invalid Unicode string, which may cause the second preg_replace() to return NULL, and a deprecation error for the third preg_replace(). diff --git a/HtmlNode.php b/HtmlNode.php index 9bc6a1a..9649d37 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -351 +351 @@ class HtmlNode - $ret = preg_replace('/\s+/', ' ', $ret); + $ret = preg_replace('/\s+/u', ' ', $ret); And about the manual: I see now that the navigation...
Looks good now! However, you must set the Unicode flag here or else preg_replace() returns NULL for certain strings, which causes (deprecation) errors further down. diff --git a/HtmlNode.php b/HtmlNode.php index 9bc6a1a..9649d37 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -351 +351 @@ class HtmlNode - $ret = preg_replace('/\s+/', ' ', $ret); + $ret = preg_replace('/\s+/u', ' ', $ret); And about the manual: I see now that the navigation sidebar is aligned far down upon page load, so that only...
PS! Would be nice if you could link to the manual from the "Support" section, because it was hard to find. https://sourceforge.net/projects/simplehtmldom/support Turns out that page is managed by SF. There is no way to change the contents of that page 😔 I added a "Manual" tab instead.
PS! Would be nice if you could link to the manual from the "Support" section, because it was hard to find. https://sourceforge.net/projects/simplehtmldom/support Good idea! I'll do that. The space thing works now, but the BR tag is still not handled well. Try the code in the original post and compare the output when (un)-commenting the commented line. I'm comparing the output of plaintext with what is displayed in the browser and it looks exactly the same. Please note that I have removed wordwrap()...
The space thing works now, but the BR tag is still not handled well. Try the code in the original post and compare the output when (un)-commenting the commented line. PS! Would be nice if you could link to the manual from the "Support" section, because it was hard to find. https://sourceforge.net/projects/simplehtmldom/support
[67c0f4e21091a9cc66151610a653724a0acb1b69] fixes the whitespace issue. Let me know if this works for you.
HtmlNode: Replace and collapse whitespace in plaintext
Shouldn't plaintext convert newlines to spaces? Did you change this recently? Surely this is a bug/regression, or am I missing something completely? The plaintext implementation is completely rewritten but it passes all tests. Your particular case probably isn't covered by any of the tests right now. I'll check this as well. At the very least <br> seems to work right. PS! Where on the SourceForge page is the link to the manual (the one with the clickable tabs with examples, etc.)? I hope you didn't...
Shouldn't plaintext convert newlines to spaces? Did you change this recently? Surely this is a bug/regression, or am I missing something completely? $text = "<p>Hello" . "\n" . "World</p>"; $plain = str_get_html($text)->plaintext; echo "PLAINTEXT:\n" . $plain . "\n\n"; echo "WORDWRAP:\n" . wordwrap($plain, 80) . "\n"; PS! Where on the SourceForge page is the link to the manual (the one with the clickable tabs with examples, etc.)? I hope you didn't remove this, because I use it as a reference all...
Please try again with current master. From what I can tell, the output looks right: ***** ** ********,*** ****** ********. ******* **** *** ** ** ******* *** *** ****** ***. *** **** *** ****** ****. *.***. ** **** *** ***** **** ***** ********* *** ************ ** *** *** *** *** ** ****. *** ** ****, ******** ******* *** ******** ********** * ** ******* ****. ******** ** *** *** **** ***********, ** *** *** *** * *** *** ********* .*** ***** ******* *** ** **** ***.*** ****** *** ** **** *** ********...
iconv() detected an illegal character in input string
This is fixed via [c53a612e6fe61d5b1efc0c3270e20aa34e4e84ee]. Instead of using //IGNORE, it needs to be wrapped inside a try-catch block, so that the character set is detected properly. Eventually, this will be replaced by a better solution, but this works for now. Thanks again for reporting!
HtmlDocument: Use try-catch block for iconv
HtmlNode: Fix empty if-statement
docs: Include recent changes
HtmlDocument: Let the parser decode entities
HtmlDocument: Inline token_equal, _slash, and _attr
HtmlDocument: Don't use magic functions
HtmlNode: Stop removing UTF-8 BOM from the end of a string
HtmlDocument: Fix broken $forceTagsClosed = false
HtmlDocument: Use shortcuts for seek methods
HtmlDocument: Inline skip method
HtmlDocument: Add shortcuts for the parser
HtmlDocument: Use native functions for tag names and attribute values
Fix memory parsing test
HtmlDocument: Don't remove noise before parsing.
HtmlDocument: Don't assign nodes by reference
Thanks for your good work! The error message first appeared after I upgraded PHP from 8.0 to 8.1 last week.
Incorrect handling of <br> tags next to line breaks
Thanks for reporting. I fixed your original message. You are right, the current implementation of <br> is wrong. I haven't tested this yet but it should give slightly better results if you define DEFAULT_BR_TEXT like this: define("DEFAULT_BR_TEXT", PHP_EOL) At the very least, this makes it platform independent. That said, there is additional work to do in the parser to handle all cases (like the <br> a case).
iconv() detected an illegal character in input string
Thanks for reporting. It took me a while to figure out what is going on. Am I right to assume that you are running on PHP 8.x? In previous versions that error would not have been reported because of the error suppression operator (@). (Un-)fortunately the behavior of this operator changed in PHP 8: https://php.watch/versions/8.0/fatal-error-suppression The behavior for //IGNORE depends on the specific implementation of iconv, some of which completely ignore this flag. Still, this is a good hack to...
The text should say "The BR tag is not...". Evidently this editor interprets HTML tags as-is, and initial postings can't be edited :/
Incorrect handling of <br> tags next to line breaks
iconv() detected an illegal character in input string
PHP 7 .x compatibility
parsing stops after first multibyte character
Comments on MAX_FILE_SIZE
Notify when zero elements were found
Role attribute
Slashdot example updated
Thanks for the feedback! The example in 1.9 is probably not functional anymore, but there is an updated version in the current master that still works. Here is the link for future reference: https://sourceforge.net/p/simplehtmldom/repository/ci/master/tree/example/scraping/example_scraping_slashdot.php
That choice is entirely up to you.
Is this project active anymore?
How to avoid break on 404 errors?
Good to know you found a solution :)
Traversing the Dom within a series of columns
You probably figured it out in the mean time, but here is a complete example that will give you what you want. <?php include_once 'simple_html_dom.php'; $doc = <<<EOD <tr> <td></td> <td id="column2" class="style3">A</td> <td id="column2" class="style2">B</td> <td> <a href="#link")>Description of Link</a> </td> </tr> EOD; $html = str_get_html($doc); $href = $html->find('a', 0)->href; $description = $html->find('a', 0)->innertext; echo $href . PHP_EOL . $description . PHP_EOL; // #link // Description...
This is probably no longer relevant but the for loop in your example indexes over the value of the first element instead of all script elements. foreach($items->find('script',0) as $e) { $e->outertext = ''; echo '$e: ' . $e . '<br/>'; } Notice the ,0 in ->find('script',0). This is why the error occurs. Here is the correct version: foreach($items->find('script') as $e) { $e->outertext = ''; echo '$e: ' . $e . '<br/>'; }
Removing tags does not work
Timezone change