Get only text in leaf nodes (avoid duplication)
End tags erroneously included in plaintext
Actually it has to be done something like this, because this function can be called from inside the library, and we want to get the first call that is outside the library. PS! Is the maintainer active these days? Has been quiet for a while. diff --git a/HtmlNode.php b/HtmlNode.php index 9649d37..99dbda4 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -549,3 +554,12 @@ class HtmlNode { - return $this->find($selector, $idx, $lowercase) ?: null; + if(!$element = $this->find($selector, $idx, $lowercase))...
diff --git a/HtmlNode.php b/HtmlNode.php index 9649d37..aef8b17 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -549,3 +549,12 @@ class HtmlNode + function first($selector, $idx = 0, $lowercase = false) + { + return $this->expect($selector, $idx, $lowercase); + } Missed semicolon and preformatting.
Convenience function for getting first element
diff --git a/simple_html_dom.php b/simple_html_dom.php index bce4d9e..97d6e1d 100644 --- a/simple_html_dom.php +++ b/simple_html_dom.php @@ -117,3 +117,3 @@ function file_get_html( $dom->clear(); - return false; + $contents = ""; } @@ -144,5 +144,4 @@ function str_get_html( $dom->clear(); - return false; + $contents = ""; } - return $dom->load($str, $lowercase, $stripRN); Better version with tabs. PS! The inline editor and preview function on the site seems to hide the first line of content :|
Always tell user where he expected non-existing element
Never return false on documents
Looks good now! However, you must set the Unicode flag, or else preg_replace() may return an invalid Unicode string, which may cause the second preg_replace() to return NULL, and a deprecation error for the third preg_replace(). diff --git a/HtmlNode.php b/HtmlNode.php index 9bc6a1a..9649d37 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -351 +351 @@ class HtmlNode - $ret = preg_replace('/\s+/', ' ', $ret); + $ret = preg_replace('/\s+/u', ' ', $ret); And about the manual: I see now that the navigation...
Looks good now! However, you must set the Unicode flag here or else preg_replace() returns NULL for certain strings, which causes (deprecation) errors further down. diff --git a/HtmlNode.php b/HtmlNode.php index 9bc6a1a..9649d37 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -351 +351 @@ class HtmlNode - $ret = preg_replace('/\s+/', ' ', $ret); + $ret = preg_replace('/\s+/u', ' ', $ret); And about the manual: I see now that the navigation sidebar is aligned far down upon page load, so that only...
The space thing works now, but the BR tag is still not handled well. Try the code in the original post and compare the output when (un)-commenting the commented line. PS! Would be nice if you could link to the manual from the "Support" section, because it was hard to find. https://sourceforge.net/projects/simplehtmldom/support
Shouldn't plaintext convert newlines to spaces? Did you change this recently? Surely this is a bug/regression, or am I missing something completely? $text = "<p>Hello" . "\n" . "World</p>"; $plain = str_get_html($text)->plaintext; echo "PLAINTEXT:\n" . $plain . "\n\n"; echo "WORDWRAP:\n" . wordwrap($plain, 80) . "\n"; PS! Where on the SourceForge page is the link to the manual (the one with the clickable tabs with examples, etc.)? I hope you didn't remove this, because I use it as a reference all...
The text should say "The BR tag is not...". Evidently this editor interprets HTML tags as-is, and initial postings can't be edited :/
Incorrect handling of <br> tags next to line breaks
I know the difference :) I thought it would be better to use a blank string to begin with rather than checking for null later, but you know your own code better and you probably have your reasons. Anyway, no more warnings with the new version, so I'm happy!
Without this patch I get error message like: HtmlDocument.php(269):trim(): Passing null to parameter #1 ($string) of type string is deprecated $ php -v PHP 8.1.4 (cli) (built: Apr 4 2022 05:02:21) (NTS)
Without this oatch I get error message like: HtmlDocument.php(269):trim(): Passing null to parameter #1 ($string) of type string is deprecated $ php -v PHP 8.1.4 (cli) (built: Apr 4 2022 05:02:21) (NTS)
Very happy that you're still maintaining this project. It's my go-to library for parsing HTML and I use it every day. See also my small contribution #193 for compatibility with PHP 8.
Patch for PHP 8
find("ul a") finds a outside ul
How do I do find p tags whose class contains neither foo nor bar? $html->find("p:not([class~=foo]) # excludes foo, but includes bar
Very nice solution!
Notify when zero elements were found
Comments on MAX_FILE_SIZE
Very nice! This is most useful. Feel free to close this issue and I'll reopen it if I see any problems!
Feel free to close this issue and I'll reopen it if I see any problems!
No problem :) If /u recognises \xc2\xa0 as one unit, your patch should work. Feel free to close this issue and I'll reopen it if I see any problems!
This example code is clearly wrong. Ignore it for the moment being, and I'll updated it as necessary. If not you may close it in a week or so.
Access to array of matched elements
Match elements that don't contain a certain value
Another example. The following code breaks if nbsp is not handled as a character sequence. $html = str_get_html("«Hello, World»"); echo $html->plaintext; The fix I'm using at the moment: diff --git a/simple_html_dom.php b/simple_html_dom.php index c909d18..8e747f3 100644 --- a/simple_html_dom.php +++ b/simple_html_dom.php @@ -502,6 +502,8 @@ class simple_html_dom_node // Reduce whitespace at start/end to a single (or none) space - $ret = preg_replace('/[ \t\n\r\0\x0B\xC2\xA0]+$/', ' ',...
I can confirm that my enclosed example doesn't scream "ARGH!!" anymore. Sounds like you had a good understanding of the problem. I'll let you know if I run into similar issues.
:) 9d94f71 has the same problem as Feature Request #62 (which is really a Bug Report). trim() is not multibyte safe, and so trim($foo, "\xc2\xa0") removes \xc2 and \xa0 individually. See PHP manual pages for trim() for a proper solution. Implementing your own trim function may be necessary. Consider something like: $pattern = "[\t\r\n ]|(\xc2\xa0)"; $foo = " \xc2\xa0\t\rfoo\xc2\xa0 "; $foo = preg_replace("/(^$pattern)|($pattern$)/", "", $foo); echo "[$foo]\n";
:) 9d94f71 has the same problem as Feature Request #62 (which is really a Bug Report). trim() is not multibyte safe, and so trim($foo, "\xc2\xa0") removes \xc2 and \xa0 individually. See PHP manual pages for trim() for a proper solution.
1) It fixes the problem. 2) Your changes are not necessary, at least not for this isolated case. It's simple: The \s destroys Unicode sequences if you don't apply the u flag. You may have to do this other places as well: There are six instances of preg_replace() using \s. Some of these may deal with ASCII-only strings, though. You should know what to do :)
index a078078..708e993 100644 --- a/simple_html_dom.php +++ b/simple_html_dom.php @@ -2218,3 +2218,3 @@ class simple_html_dom // https://www.w3.org/TR/xml/#AVNormalize - $value = preg_replace("/[\r\n\t\s]+/", ' ', $value); + $value = preg_replace("/[\r\n\t\s]+/u", ' ', $value); $value = trim($value);
index a078078..708e993 100644 --- a/simple_html_dom.php +++ b/simple_html_dom.php @@ -2219 +2219 @@ class simple_html_dom - $value = preg_replace("/[\r\n\t\s]+/", ' ', $value); + $value = preg_replace("/[\r\n\t\s]+/u", ' ', $value);
I found the culprit :) $value = preg_replace("/[\r\n\t\s]+/", ' ', $value); // THE PROBLEM $value = preg_replace("/[\r\n\t\s]+/u", ' ', $value); // THE SOLUTION
Same problem. My UNIX locale settings are below, but unsetting them made no difference. My php.ini is a standard one. The problem does not occur on a Debian machine with PHP 7.0.x. LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_CTYPE=UTF-8 When doing print_r($html) and piping to less, the error is already there, in the object tree (simplified): [root] => simple_html_dom_node Object [children] => Array [0] => simple_html_dom_node Object [attr] => Array [content] => <C2> <C3> <C3> á It's curious that the 2nd...
Noted. So remove() must be used with caution, or not at all, until further notice.
Same problem. My UNIX locale settings are below, but unsetting them made no difference. My php.ini is a standard one. The problem does not occur on a Debian machine with PHP 7.0.x. LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_CTYPE=UTF-8 Do you think the misrepresentation happens during the building of the object tree (str_get_html) or when fetching the value (->content)?
Problem with the remove function
OK, I found the problem. It occurs when setting a locale. You should now be able to look into why the content property is extracted incorrectly. setlocale(LC_ALL, "fr_FR.UTF-8"); $latin = utf8_encode("\xa0\xc5\xe0\xe1"); for($i=0; $i<strlen($latin); $i+=2) printf("%02x%02x ", ord($latin[$i]), ord($latin[$i+1])); echo "\n"; $string = sprintf('<meta content="%s">', $latin); $html = str_get_html($string); $content = $html->find("meta", 0)->content; for($i=0; $i<strlen($content); $i+=2) printf("%02x%02x...
OK, I found the problem: When setting European locale, contentis extracted incorrectly. setlocale(LC_ALL, "fr_FR.UTF-8"); $latin = utf8_encode("\xa0\xc5\xe0\xe1"); for($i=0; $i<strlen($latin); $i+=2) printf("%02x%02x ", ord($latin[$i]), ord($latin[$i+1])); echo "\n"; $string = sprintf('<meta content="%s">', $latin); $html = str_get_html($string); $content = $html->find("meta", 0)->content; for($i=0; $i<strlen($content); $i+=2) printf("%02x%02x ", ord($content[$i]), ord($content[$i+1])); echo "\n";...
UTF-8 here too. $foo = $html->find("meta", 0)->content; for($i=0; $i<strlen($foo); $i+=2) printf("%02x%02x\n", ord($foo[$i]), ord($foo[$i+1])); c220 <-- c2a1 c2a2 c2a3 ... c2bd c2be c2bf c380 c381 c382 c383 c384 c320 <-- c386 c387 c388 c389 ... c39d c39e c39f c320 <-- c3a1 c3a2 c3a3 ... c3bd c3be c3bf
I'm using the latest Git master (0e03308). Piping to less, I get the following. Notice C2 and C3 which indicate that they are single bytes (ie. an incomplete Unicode character sequence). <C2> ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄ<C3> ÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß<C3> áâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
I'm using the latest Git master. $html = str_get_html("<p>Hello, World</p><p>\xc2\xa0</p>"); foreach($html->find("p") as $p) printf("[%s]\n", $p->plaintext); The output I expect is: [Hello, World] What I get is: [Hello, World] [ ] <-- nbsp inside the brackets I don't want the empty element. I want the parser to consider nbsp as space and trim it, ultimately excluding it from the array of found elements.
I'm using the latest Git master. $html = str_get_html("<p>Hello, World</p><p>\xc2\xa0</p>"); foreach($html->find("p") as $p) printf("[%s]\n", $p->plaintext); The output I expect is: [Hello, World] What I get is: [Hello, World] [ ] <-- nbsp inside the brackets I don't want the empty element. I want the parser to consider nbsp as space and trim it, ultimately excluding it in the array of found elements.
I'm using the latest Git master. $html = str_get_html("<p>Hello, World</p><p>\xc2\xa0</p>"); foreach($html->find("p") as $p) printf("[%s]\n", $p->plaintext); The output I expect is: [Hello, World] What I get is: [Hello, World] [ ] <-- nbsp inside the brackets I don't want the empty element. I want the parser to consider nbsp as space and trim it, ultimately not including it in the array of found elements.
I'm using the latest Git master. $html = str_get_html("<p>Hello, World</p><p>\xc2\xa0</p>"); foreach($html->find("p") as $p) printf("[%s]\n", $p->plaintext); The output I expect is: [Hello, World] What I get is: [Hello, World] [ ] <-- nbsp inside the brackets I don't want the empty element. I want the parser to consider nbsp as space and trim it, ultimately not including it in the array of found elements.
I'm using the latest Git master. $html = str_get_html("<p>Hello, World</p><p>\xc2\xa0</p>"); foreach($html->find("p") as $p) printf("[%s]\n", $p->plaintext); The output I expect is: [Hello, World] What I get is: [Hello, World] [ ] I don't want the empty element. I want the parser to consider nbsp as space and trim it, ultimately not including it in the array of found elements.
Unicode characters not extracted correctly
Consider nbsp to be whitespace
$html = file_get_html("https://www.rogerebert.com/reviews/dark-phoenix-2019"); foreach($html->find("div[itemprop=reviewBody] > p") as $p) printf("%s\n\n", wordwrap($p->plaintext)); I found an incident where whitespace is not removed (marked with underscore). Can you fix this? ...and “_X-Men: Apocalypse_,” Simon Kinberg_’s directorial debut... ...Jean Grey, Professor X, Raven (_Jennifer Lawrence_)... ...named Vuk (who takes the body of Jessica Chastain_) is encouraging...
I'm so happy with these changes. The package is now like a dream to use, because you instantly get the content you want without struggling with manual trimming and decoding every single time.
When will everything be merged?
Just post a notice when you're all done merging everything to master. I'm so happy this project is active. I really thought it was abandonware when I first started using it, as bug reports were years old with no response, etc. For whatever reason things have gotten back on track, I'm grateful, and happy to help. Actually, parsing HTML is so fundamental, and this library is so user-friendly, that I think it should be a integral part of PHP. (PHP doesn't have anything out of the box for this.)
Looks like it works! Let me know when you merge things, so I don't have to choose between decoding and trimming :)
Thanks for a good follow-up. I'll try the latest commit. I actually know one of the main W3C guys, so if necessary I could ask for his opinion on the matter. Malplaced whitespace typically ends up in HTML because of sloppiness by non-tech people, eg. a journalist copy-pasting an article title (in a proportional font) into a CMS, not noticing the minuscle leading/trailing whitespace. So you could say that this is where the trim function really belongs: At the origin of the fault. I doubt any developer...
Thanks for a good follow-up. I'll try the latest commit. I actually know one of the main W3C guys, so if necessary I could ask for his opinion on the matter. Malplaced whitespace typically ends up in HTML because of sloppiness by non-tech people, eg. a journalist copy-pasting an article title (in a proportional font) into a CMS, not noticing the minuscle leading/trailing whitespace. So you could say that this is where the trim function really belongs: At the origin of the fault. I doubt any developer...
Thanks for a good follow-up. I'll try the latest commit. I actually know one of the main W3C guys, so if necessary I could ask for his opinion on the matter. Malplaced whitespace typically ends up in HTML because of sloppiness by non-tech people, eg. a journalist copy-pasting an article title (in a proportional font) into a CMS, not noticing the minuscle leading/trailing whitespace. So you could say that this is where the trim function really belongs: At the origin of the fault. I doubt any developer...
Thanks for the fix! Does the W3C HTML specification say that whitespace inside quotes are in fact part of the actual value? If so, I sort of concede, but not very happily, I must admit. Though whitespace inside quotes is no doubt due to sloppiness on the page author's part, in the real world you always want trimmed values to avoid messing up database fields, plain-text terminal output, markup, and other sources that would carry the whitespace with them. I appreciate you're trying to follow standards,...
Any news on trimming?
$html = " <p> <span> </span> foo </p> "; $html = str_get_html($html); $html->find("span", 0)->remove(); printf("(%s)\n", $html->find("p", 0)->plaintext); $html = str_get_html(' <meta name="description" content=" bar ">'); printf("(%s)\n", $html->find("meta[name=description]", 0)->content);
$html = " <p> <figure> </figure> foo </p> "; $html = str_get_html($html); $html->find("figure", 0)->remove(); printf("(%s)\n", $html->find("p", 0)->plaintext); $html = str_get_html(' <meta name="description" content=" bar ">'); printf("(%s)\n", $html->find("meta[name=description]", 0)->content);
$html = " < p > < figure > < /figure > foo < /p > "; $html = str_get_html($html); $html->find("figure", 0)->remove(); printf("(%s)\n", $html->find("p", 0)->plaintext); $html = str_get_html(' < meta name = " description " content = " bar " > '); printf("(%s)\n", $html->find("meta[name=description]", 0)->content);
Ever thought about migrating the whole thing to Github? Sourceforge feels kind of outdated, though there may be aspects of this I don't know about...
I did a couple of quick tests and it seems to work as expected. Looks like it didn't require much coding either, so all good and everyone's happy! I'll let you know if anything breaks.
This is just not my day: error: the requested upstream branch 'origin/EntityDecoding' does not exist. Starting over seemed to work better: $ git clone git://git.code.sf.net/p/simplehtmldom/repository $ cd repository $ git checkout EntityDecoding $ git fetch --all On a side note, it would be good if the basename of the URL was simple_html_dom or simplehtmldom (whichever is the official), rather than repository.
This is just not my day: error: the requested upstream branch 'origin/EntityDecoding' does not exist. I started over, which seemed to work better: $ git clone git://git.code.sf.net/p/simplehtmldom/repository $ cd repository $ git checkout EntityDecoding $ git fetch --all On a side note, it would be good if the basename of the URL was simple_html_dom or simplehtmldom (whichever is the official), rather than repository.
This is just not my day: error: the requested upstream branch 'origin/EntityDecoding' does not exist. I also tried starting over, to no avail: $ git clone git://git.code.sf.net/p/simplehtmldom/repository $ cd repository $ git fetch --all $ git branch * master On a side note, it would be good if the basename of the URL was simplehtmldom or simple_html_dom, rather than repository.
I still had to do git pull origin EntityDecoding. Maybe this has something to do with .gitconfig and definition of remotes.
Looks fine now. Post a notice when you have trimming in place.
Trimming doesn't seem to take place. You wrote " it [trimming] needs to be applied before decoding". Does this mean it's not implemented yet?
Found some things that are not decoded as expected: $html = str_get_html('<meta name="description" content="Häagen-Dazs">'); echo $html->find("meta[name=description]", 0)->content . "\n"; echo $html->find("meta[name=description]", 0)->getAttribute("content") . "\n"; Results in: Häagen-Dazs Häagen-Dazs
Very exciting! Will try this. I wouldn't worry about edge cases like &amp; for normal use. The only relevant case seems to be markup that is verbatimely referring to entities.
It doesn't seem to break any of my scripts. However, trimming doesn't occur. You wrote " it [trimming] needs to be applied before decoding". Does this mean it's not implemented yet?
I had to do git checkout -b EntityDecoding and then git pull origin EntityDecoding. Let me know if there's an easier way to pull a branch.
PS! Which Git commands do I use to get this branch/commit? A normal pull gets just the master branch, and there is no such commit there.
Very exciting! Will try this. I wouldn't worry about edge cases like &amp; for normal use. Hope other people will test it too and shed light on problems that are likely, if any.
Very exciting! Will try this. I wouldn't worry about edge cases like &amp; for normal use. Hope other people will test it too and shed light on problems that are likely, if any. PS! Which Git commands do I use to get this branch/commit? A normal pull gets just the master branch, and there is no such commit there.
Very exciting! Will try this and let you know how it works for me. Stuff like &amp; are very unlikely edge cases I wouldn't worry about for normal use. Hope other people will test it too and shed light on problems that are likely, if any. PS! Which Git commands do I use to get this branch/commit? A normal pull gets just the master branch, and there is no such commit there.
Very exciting! Will try this and let you know how it works for me. Stuff like &amp; are very unlikely edge cases I wouldn't worry about for normal use. Hope other people will test it too and shed light on problems that are likely, if any.
I doubt a change in performance would have any significant real world impact. I think the correct thing is to always decode and trim. Not doing is relying too much on the HTML, which will break in other ways if the HTML changes. As for breaking code, decoding a string that is already decoded will practically always return the same string. So I think you could actually get away with just changing the code. Unless a lot of people really expect a lot of non-trimmed, non-decoded strings. I would shout...
Actually, trim() would also be a desired default. I have never needed not to remove surrounding whitespace from an accessed value (unless when assuming/hoping there's never going to be any).
Always decode content values from the DOM tree
Works now. Either it was the limit, or the source code changed. Will let you know if something similar happens.
Wikipedia breaks the parser
Removed elements aren't properly removed
I think what I mostly would expect is the plain text to look like the text as displayed in the browser, in other words, a single whitespace no matter how many whitespaces are in the source. Possibly, there may be instances where multiplace whitespace are desired (like you hint at), but I can't think of any at the moment. Of course the replace will work around the problem.
Seems to work! However, the example below creates multiple whitespace where there should be only one. I don't know if this is a different bug, but it's at least somewhat related. $str = '<p>I am saying <!-- --> <span> <a href=""> Hello World </a> </span> <!-- --> to you.</p>'; $html = str_get_html($str); echo $html->find("p", 0)->plaintext . "\n"; I am saying Hello World to you.
Under "Files", the latest version (1.8.1) of simplehtmldom is dated 2019-01-13 (three weeks old), which seemed outdated. So I did svn checkout https://svn.code.sf.net/p/simplehtmldom/code/trunk simplehtmldom-code, which I assumed would give me the latest version. I now see that this gave me an ancient version @version 1.5 ($Rev: 210 $), for unknown reasons. I also now noticed that there is a Git repository, which puzzles me, as I thought SourceForge had no affiliation with Git. When cloning this,...
Fatal error: Stream does not support seeking
Similarly, the following produces "World.Hello". Should produce "World. Hello". $file = '<a href="">World. </a>Hello';
Missing whitespace in plaintext property
Isn't there a profiling/kernel expert on the team or somewhere in the community?...
Accessible and fast CLI-mode
$html->find("*") does not find all tags
Cursor starts running off / Unable to reset emulator
My mistake this time. I forgot ./configure --with-readline. Now sqsh works with cursor...