Activity for olalav

  • olalav olalav created ticket #67

    Get only text in leaf nodes (avoid duplication)

  • olalav olalav created ticket #205

    End tags erroneously included in plaintext

  • olalav olalav posted a comment on ticket #203

    Actually it has to be done something like this, because this function can be called from inside the library, and we want to get the first call that is outside the library. PS! Is the maintainer active these days? Has been quiet for a while. diff --git a/HtmlNode.php b/HtmlNode.php index 9649d37..99dbda4 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -549,3 +554,12 @@ class HtmlNode { - return $this->find($selector, $idx, $lowercase) ?: null; + if(!$element = $this->find($selector, $idx, $lowercase))...

  • olalav olalav posted a comment on ticket #204

    diff --git a/HtmlNode.php b/HtmlNode.php index 9649d37..aef8b17 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -549,3 +549,12 @@ class HtmlNode + function first($selector, $idx = 0, $lowercase = false) + { + return $this->expect($selector, $idx, $lowercase); + } Missed semicolon and preformatting.

  • olalav olalav created ticket #204

    Convenience function for getting first element

  • olalav olalav posted a comment on ticket #201

    diff --git a/simple_html_dom.php b/simple_html_dom.php index bce4d9e..97d6e1d 100644 --- a/simple_html_dom.php +++ b/simple_html_dom.php @@ -117,3 +117,3 @@ function file_get_html( $dom->clear(); - return false; + $contents = ""; } @@ -144,5 +144,4 @@ function str_get_html( $dom->clear(); - return false; + $contents = ""; } - return $dom->load($str, $lowercase, $stripRN); Better version with tabs. PS! The inline editor and preview function on the site seems to hide the first line of content :|

  • olalav olalav created ticket #203

    Always tell user where he expected non-existing element

  • olalav olalav created ticket #201

    Never return false on documents

  • olalav olalav modified a comment on ticket #199

    Looks good now! However, you must set the Unicode flag, or else preg_replace() may return an invalid Unicode string, which may cause the second preg_replace() to return NULL, and a deprecation error for the third preg_replace(). diff --git a/HtmlNode.php b/HtmlNode.php index 9bc6a1a..9649d37 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -351 +351 @@ class HtmlNode - $ret = preg_replace('/\s+/', ' ', $ret); + $ret = preg_replace('/\s+/u', ' ', $ret); And about the manual: I see now that the navigation...

  • olalav olalav posted a comment on ticket #199

    Looks good now! However, you must set the Unicode flag here or else preg_replace() returns NULL for certain strings, which causes (deprecation) errors further down. diff --git a/HtmlNode.php b/HtmlNode.php index 9bc6a1a..9649d37 100644 --- a/HtmlNode.php +++ b/HtmlNode.php @@ -351 +351 @@ class HtmlNode - $ret = preg_replace('/\s+/', ' ', $ret); + $ret = preg_replace('/\s+/u', ' ', $ret); And about the manual: I see now that the navigation sidebar is aligned far down upon page load, so that only...

  • olalav olalav posted a comment on ticket #199

    The space thing works now, but the BR tag is still not handled well. Try the code in the original post and compare the output when (un)-commenting the commented line. PS! Would be nice if you could link to the manual from the "Support" section, because it was hard to find. https://sourceforge.net/projects/simplehtmldom/support

  • olalav olalav posted a comment on ticket #199

    Shouldn't plaintext convert newlines to spaces? Did you change this recently? Surely this is a bug/regression, or am I missing something completely? $text = "<p>Hello" . "\n" . "World</p>"; $plain = str_get_html($text)->plaintext; echo "PLAINTEXT:\n" . $plain . "\n\n"; echo "WORDWRAP:\n" . wordwrap($plain, 80) . "\n"; PS! Where on the SourceForge page is the link to the manual (the one with the clickable tabs with examples, etc.)? I hope you didn't remove this, because I use it as a reference all...

  • olalav olalav posted a comment on ticket #199

    The text should say "The BR tag is not...". Evidently this editor interprets HTML tags as-is, and initial postings can't be edited :/

  • olalav olalav created ticket #199

    Incorrect handling of <br> tags next to line breaks

  • olalav olalav posted a comment on ticket #193

    I know the difference :) I thought it would be better to use a blank string to begin with rather than checking for null later, but you know your own code better and you probably have your reasons. Anyway, no more warnings with the new version, so I'm happy!

  • olalav olalav modified a comment on ticket #193

    Without this patch I get error message like: HtmlDocument.php(269):trim(): Passing null to parameter #1 ($string) of type string is deprecated $ php -v PHP 8.1.4 (cli) (built: Apr 4 2022 05:02:21) (NTS)

  • olalav olalav posted a comment on ticket #193

    Without this oatch I get error message like: HtmlDocument.php(269):trim(): Passing null to parameter #1 ($string) of type string is deprecated $ php -v PHP 8.1.4 (cli) (built: Apr 4 2022 05:02:21) (NTS)

  • olalav olalav posted a comment on ticket #186

    Very happy that you're still maintaining this project. It's my go-to library for parsing HTML and I use it every day. See also my small contribution #193 for compatibility with PHP 8.

  • olalav olalav created ticket #193

    Patch for PHP 8

  • olalav olalav created ticket #186

    find("ul a") finds a outside ul

  • olalav olalav posted a comment on ticket #63

    How do I do find p tags whose class contains neither foo nor bar? $html->find("p:not([class~=foo]) # excludes foo, but includes bar

  • olalav olalav posted a comment on ticket #65

    Very nice solution!

  • olalav olalav created ticket #65

    Notify when zero elements were found

  • olalav olalav created ticket #64

    Comments on MAX_FILE_SIZE

  • olalav olalav posted a comment on ticket #63

    Very nice! This is most useful. Feel free to close this issue and I'll reopen it if I see any problems!

  • olalav olalav posted a comment on ticket #61

    Feel free to close this issue and I'll reopen it if I see any problems!

  • olalav olalav posted a comment on ticket #62

    No problem :) If /u recognises \xc2\xa0 as one unit, your patch should work. Feel free to close this issue and I'll reopen it if I see any problems!

  • olalav olalav posted a comment on ticket #46

    This example code is clearly wrong. Ignore it for the moment being, and I'll updated it as necessary. If not you may close it in a week or so.

  • olalav olalav created ticket #46

    Access to array of matched elements

  • olalav olalav created ticket #63

    Match elements that don't contain a certain value

  • olalav olalav posted a comment on ticket #62

    Another example. The following code breaks if nbsp is not handled as a character sequence. $html = str_get_html("&#xAB;Hello, World&#xBB;"); echo $html->plaintext; The fix I'm using at the moment: diff --git a/simple_html_dom.php b/simple_html_dom.php index c909d18..8e747f3 100644 --- a/simple_html_dom.php +++ b/simple_html_dom.php @@ -502,6 +502,8 @@ class simple_html_dom_node // Reduce whitespace at start/end to a single (or none) space - $ret = preg_replace('/[ \t\n\r\0\x0B\xC2\xA0]+$/', ' ',...

  • olalav olalav posted a comment on ticket #172

    I can confirm that my enclosed example doesn't scream "ARGH!!" anymore. Sounds like you had a good understanding of the problem. I'll let you know if I run into similar issues.

  • olalav olalav modified a comment on ticket #61

    :) 9d94f71 has the same problem as Feature Request #62 (which is really a Bug Report). trim() is not multibyte safe, and so trim($foo, "\xc2\xa0") removes \xc2 and \xa0 individually. See PHP manual pages for trim() for a proper solution. Implementing your own trim function may be necessary. Consider something like: $pattern = "[\t\r\n ]|(\xc2\xa0)"; $foo = " \xc2\xa0\t\rfoo\xc2\xa0 "; $foo = preg_replace("/(^$pattern)|($pattern$)/", "", $foo); echo "[$foo]\n";

  • olalav olalav posted a comment on ticket #61

    :) 9d94f71 has the same problem as Feature Request #62 (which is really a Bug Report). trim() is not multibyte safe, and so trim($foo, "\xc2\xa0") removes \xc2 and \xa0 individually. See PHP manual pages for trim() for a proper solution.

  • olalav olalav posted a comment on ticket #62

    1) It fixes the problem. 2) Your changes are not necessary, at least not for this isolated case. It's simple: The \s destroys Unicode sequences if you don't apply the u flag. You may have to do this other places as well: There are six instances of preg_replace() using \s. Some of these may deal with ASCII-only strings, though. You should know what to do :)

  • olalav olalav modified a comment on ticket #62

    index a078078..708e993 100644 --- a/simple_html_dom.php +++ b/simple_html_dom.php @@ -2218,3 +2218,3 @@ class simple_html_dom // https://www.w3.org/TR/xml/#AVNormalize - $value = preg_replace("/[\r\n\t\s]+/", ' ', $value); + $value = preg_replace("/[\r\n\t\s]+/u", ' ', $value); $value = trim($value);

  • olalav olalav modified a comment on ticket #62

    index a078078..708e993 100644 --- a/simple_html_dom.php +++ b/simple_html_dom.php @@ -2219 +2219 @@ class simple_html_dom - $value = preg_replace("/[\r\n\t\s]+/", ' ', $value); + $value = preg_replace("/[\r\n\t\s]+/u", ' ', $value);

  • olalav olalav modified a comment on ticket #62

    I found the culprit :) $value = preg_replace("/[\r\n\t\s]+/", ' ', $value); // THE PROBLEM $value = preg_replace("/[\r\n\t\s]+/u", ' ', $value); // THE SOLUTION

  • olalav olalav modified a comment on ticket #62

    Same problem. My UNIX locale settings are below, but unsetting them made no difference. My php.ini is a standard one. The problem does not occur on a Debian machine with PHP 7.0.x. LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_CTYPE=UTF-8 When doing print_r($html) and piping to less, the error is already there, in the object tree (simplified): [root] => simple_html_dom_node Object [children] => Array [0] => simple_html_dom_node Object [attr] => Array [content] => <C2> <C3> <C3> á It's curious that the 2nd...

  • olalav olalav posted a comment on ticket #172

    Noted. So remove() must be used with caution, or not at all, until further notice.

  • olalav olalav posted a comment on ticket #62

    Same problem. My UNIX locale settings are below, but unsetting them made no difference. My php.ini is a standard one. The problem does not occur on a Debian machine with PHP 7.0.x. LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_CTYPE=UTF-8 Do you think the misrepresentation happens during the building of the object tree (str_get_html) or when fetching the value (->content)?

  • olalav olalav created ticket #172

    Problem with the remove function

  • olalav olalav modified a comment on ticket #62

    OK, I found the problem. It occurs when setting a locale. You should now be able to look into why the content property is extracted incorrectly. setlocale(LC_ALL, "fr_FR.UTF-8"); $latin = utf8_encode("\xa0\xc5\xe0\xe1"); for($i=0; $i<strlen($latin); $i+=2) printf("%02x%02x ", ord($latin[$i]), ord($latin[$i+1])); echo "\n"; $string = sprintf('<meta content="%s">', $latin); $html = str_get_html($string); $content = $html->find("meta", 0)->content; for($i=0; $i<strlen($content); $i+=2) printf("%02x%02x...

  • olalav olalav posted a comment on ticket #62

    OK, I found the problem: When setting European locale, contentis extracted incorrectly. setlocale(LC_ALL, "fr_FR.UTF-8"); $latin = utf8_encode("\xa0\xc5\xe0\xe1"); for($i=0; $i<strlen($latin); $i+=2) printf("%02x%02x ", ord($latin[$i]), ord($latin[$i+1])); echo "\n"; $string = sprintf('<meta content="%s">', $latin); $html = str_get_html($string); $content = $html->find("meta", 0)->content; for($i=0; $i<strlen($content); $i+=2) printf("%02x%02x ", ord($content[$i]), ord($content[$i+1])); echo "\n";...

  • olalav olalav posted a comment on ticket #62

    UTF-8 here too. $foo = $html->find("meta", 0)->content; for($i=0; $i<strlen($foo); $i+=2) printf("%02x%02x\n", ord($foo[$i]), ord($foo[$i+1])); c220 <-- c2a1 c2a2 c2a3 ... c2bd c2be c2bf c380 c381 c382 c383 c384 c320 <-- c386 c387 c388 c389 ... c39d c39e c39f c320 <-- c3a1 c3a2 c3a3 ... c3bd c3be c3bf

  • olalav olalav posted a comment on ticket #62

    I'm using the latest Git master (0e03308). Piping to less, I get the following. Notice C2 and C3 which indicate that they are single bytes (ie. an incomplete Unicode character sequence). <C2> ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄ<C3> ÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß<C3> áâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

  • olalav olalav modified a comment on ticket #61

    I'm using the latest Git master. $html = str_get_html("<p>Hello, World</p><p>\xc2\xa0</p>"); foreach($html->find("p") as $p) printf("[%s]\n", $p->plaintext); The output I expect is: [Hello, World] What I get is: [Hello, World] [ ] <-- nbsp inside the brackets I don't want the empty element. I want the parser to consider nbsp as space and trim it, ultimately excluding it from the array of found elements.

  • olalav olalav modified a comment on ticket #61

    I'm using the latest Git master. $html = str_get_html("<p>Hello, World</p><p>\xc2\xa0</p>"); foreach($html->find("p") as $p) printf("[%s]\n", $p->plaintext); The output I expect is: [Hello, World] What I get is: [Hello, World] [ ] <-- nbsp inside the brackets I don't want the empty element. I want the parser to consider nbsp as space and trim it, ultimately excluding it in the array of found elements.

  • olalav olalav modified a comment on ticket #61

    I'm using the latest Git master. $html = str_get_html("<p>Hello, World</p><p>\xc2\xa0</p>"); foreach($html->find("p") as $p) printf("[%s]\n", $p->plaintext); The output I expect is: [Hello, World] What I get is: [Hello, World] [ ] <-- nbsp inside the brackets I don't want the empty element. I want the parser to consider nbsp as space and trim it, ultimately not including it in the array of found elements.

  • olalav olalav modified a comment on ticket #61

    I'm using the latest Git master. $html = str_get_html("<p>Hello, World</p><p>\xc2\xa0</p>"); foreach($html->find("p") as $p) printf("[%s]\n", $p->plaintext); The output I expect is: [Hello, World] What I get is: [Hello, World] [ ] <-- nbsp inside the brackets I don't want the empty element. I want the parser to consider nbsp as space and trim it, ultimately not including it in the array of found elements.

  • olalav olalav posted a comment on ticket #61

    I'm using the latest Git master. $html = str_get_html("<p>Hello, World</p><p>\xc2\xa0</p>"); foreach($html->find("p") as $p) printf("[%s]\n", $p->plaintext); The output I expect is: [Hello, World] What I get is: [Hello, World] [ ] I don't want the empty element. I want the parser to consider nbsp as space and trim it, ultimately not including it in the array of found elements.

  • olalav olalav created ticket #62

    Unicode characters not extracted correctly

  • olalav olalav created ticket #61

    Consider nbsp to be whitespace

  • olalav olalav posted a comment on ticket #52

    $html = file_get_html("https://www.rogerebert.com/reviews/dark-phoenix-2019"); foreach($html->find("div[itemprop=reviewBody] > p") as $p) printf("%s\n\n", wordwrap($p->plaintext)); I found an incident where whitespace is not removed (marked with underscore). Can you fix this? ...and “_X-Men: Apocalypse_,” Simon Kinberg_’s directorial debut... ...Jean Grey, Professor X, Raven (_Jennifer Lawrence_)... ...named Vuk (who takes the body of Jessica Chastain_) is encouraging...

  • olalav olalav posted a comment on ticket #52

    I'm so happy with these changes. The package is now like a dream to use, because you instantly get the content you want without struggling with manual trimming and decoding every single time.

  • olalav olalav posted a comment on ticket #52

    When will everything be merged?

  • olalav olalav posted a comment on ticket #52

    Just post a notice when you're all done merging everything to master. I'm so happy this project is active. I really thought it was abandonware when I first started using it, as bug reports were years old with no response, etc. For whatever reason things have gotten back on track, I'm grateful, and happy to help. Actually, parsing HTML is so fundamental, and this library is so user-friendly, that I think it should be a integral part of PHP. (PHP doesn't have anything out of the box for this.)

  • olalav olalav posted a comment on ticket #52

    Looks like it works! Let me know when you merge things, so I don't have to choose between decoding and trimming :)

  • olalav olalav modified a comment on ticket #52

    Thanks for a good follow-up. I'll try the latest commit. I actually know one of the main W3C guys, so if necessary I could ask for his opinion on the matter. Malplaced whitespace typically ends up in HTML because of sloppiness by non-tech people, eg. a journalist copy-pasting an article title (in a proportional font) into a CMS, not noticing the minuscle leading/trailing whitespace. So you could say that this is where the trim function really belongs: At the origin of the fault. I doubt any developer...

  • olalav olalav modified a comment on ticket #52

    Thanks for a good follow-up. I'll try the latest commit. I actually know one of the main W3C guys, so if necessary I could ask for his opinion on the matter. Malplaced whitespace typically ends up in HTML because of sloppiness by non-tech people, eg. a journalist copy-pasting an article title (in a proportional font) into a CMS, not noticing the minuscle leading/trailing whitespace. So you could say that this is where the trim function really belongs: At the origin of the fault. I doubt any developer...

  • olalav olalav posted a comment on ticket #52

    Thanks for a good follow-up. I'll try the latest commit. I actually know one of the main W3C guys, so if necessary I could ask for his opinion on the matter. Malplaced whitespace typically ends up in HTML because of sloppiness by non-tech people, eg. a journalist copy-pasting an article title (in a proportional font) into a CMS, not noticing the minuscle leading/trailing whitespace. So you could say that this is where the trim function really belongs: At the origin of the fault. I doubt any developer...

  • olalav olalav posted a comment on ticket #52

    Thanks for the fix! Does the W3C HTML specification say that whitespace inside quotes are in fact part of the actual value? If so, I sort of concede, but not very happily, I must admit. Though whitespace inside quotes is no doubt due to sloppiness on the page author's part, in the real world you always want trimmed values to avoid messing up database fields, plain-text terminal output, markup, and other sources that would carry the whitespace with them. I appreciate you're trying to follow standards,...

  • olalav olalav posted a comment on ticket #52

    Any news on trimming?

  • olalav olalav modified a comment on ticket #52

    $html = " <p> <span> </span> foo </p> "; $html = str_get_html($html); $html->find("span", 0)->remove(); printf("(%s)\n", $html->find("p", 0)->plaintext); $html = str_get_html(' <meta name="description" content=" bar ">'); printf("(%s)\n", $html->find("meta[name=description]", 0)->content);

  • olalav olalav modified a comment on ticket #52

    $html = " <p> <figure> </figure> foo </p> "; $html = str_get_html($html); $html->find("figure", 0)->remove(); printf("(%s)\n", $html->find("p", 0)->plaintext); $html = str_get_html(' <meta name="description" content=" bar ">'); printf("(%s)\n", $html->find("meta[name=description]", 0)->content);

  • olalav olalav posted a comment on ticket #52

    $html = " < p > < figure > < /figure > foo < /p > "; $html = str_get_html($html); $html->find("figure", 0)->remove(); printf("(%s)\n", $html->find("p", 0)->plaintext); $html = str_get_html(' < meta name = " description " content = " bar " > '); printf("(%s)\n", $html->find("meta[name=description]", 0)->content);

  • olalav olalav posted a comment on ticket #52

    Ever thought about migrating the whole thing to Github? Sourceforge feels kind of outdated, though there may be aspects of this I don't know about...

  • olalav olalav posted a comment on ticket #167

    I did a couple of quick tests and it seems to work as expected. Looks like it didn't require much coding either, so all good and everyone's happy! I'll let you know if anything breaks.

  • olalav olalav modified a comment on ticket #52

    This is just not my day: error: the requested upstream branch 'origin/EntityDecoding' does not exist. Starting over seemed to work better: $ git clone git://git.code.sf.net/p/simplehtmldom/repository $ cd repository $ git checkout EntityDecoding $ git fetch --all On a side note, it would be good if the basename of the URL was simple_html_dom or simplehtmldom (whichever is the official), rather than repository.

  • olalav olalav modified a comment on ticket #52

    This is just not my day: error: the requested upstream branch 'origin/EntityDecoding' does not exist. I started over, which seemed to work better: $ git clone git://git.code.sf.net/p/simplehtmldom/repository $ cd repository $ git checkout EntityDecoding $ git fetch --all On a side note, it would be good if the basename of the URL was simple_html_dom or simplehtmldom (whichever is the official), rather than repository.

  • olalav olalav posted a comment on ticket #52

    This is just not my day: error: the requested upstream branch 'origin/EntityDecoding' does not exist. I also tried starting over, to no avail: $ git clone git://git.code.sf.net/p/simplehtmldom/repository $ cd repository $ git fetch --all $ git branch * master On a side note, it would be good if the basename of the URL was simplehtmldom or simple_html_dom, rather than repository.

  • olalav olalav posted a comment on ticket #52

    I still had to do git pull origin EntityDecoding. Maybe this has something to do with .gitconfig and definition of remotes.

  • olalav olalav posted a comment on ticket #52

    Looks fine now. Post a notice when you have trimming in place.

  • olalav olalav modified a comment on ticket #52

    Trimming doesn't seem to take place. You wrote " it [trimming] needs to be applied before decoding". Does this mean it's not implemented yet?

  • olalav olalav posted a comment on ticket #52

    Found some things that are not decoded as expected: $html = str_get_html('<meta name="description" content="H&auml;agen-Dazs">'); echo $html->find("meta[name=description]", 0)->content . "\n"; echo $html->find("meta[name=description]", 0)->getAttribute("content") . "\n"; Results in: H&auml;agen-Dazs H&auml;agen-Dazs

  • olalav olalav modified a comment on ticket #52

    Very exciting! Will try this. I wouldn't worry about edge cases like &amp;amp; for normal use. The only relevant case seems to be markup that is verbatimely referring to entities.

  • olalav olalav posted a comment on ticket #52

    It doesn't seem to break any of my scripts. However, trimming doesn't occur. You wrote " it [trimming] needs to be applied before decoding". Does this mean it's not implemented yet?

  • olalav olalav posted a comment on ticket #52

    I had to do git checkout -b EntityDecoding and then git pull origin EntityDecoding. Let me know if there's an easier way to pull a branch.

  • olalav olalav posted a comment on ticket #52

    PS! Which Git commands do I use to get this branch/commit? A normal pull gets just the master branch, and there is no such commit there.

  • olalav olalav modified a comment on ticket #52

    Very exciting! Will try this. I wouldn't worry about edge cases like &amp;amp; for normal use. Hope other people will test it too and shed light on problems that are likely, if any.

  • olalav olalav modified a comment on ticket #52

    Very exciting! Will try this. I wouldn't worry about edge cases like &amp;amp; for normal use. Hope other people will test it too and shed light on problems that are likely, if any. PS! Which Git commands do I use to get this branch/commit? A normal pull gets just the master branch, and there is no such commit there.

  • olalav olalav modified a comment on ticket #52

    Very exciting! Will try this and let you know how it works for me. Stuff like &amp;amp; are very unlikely edge cases I wouldn't worry about for normal use. Hope other people will test it too and shed light on problems that are likely, if any. PS! Which Git commands do I use to get this branch/commit? A normal pull gets just the master branch, and there is no such commit there.

  • olalav olalav posted a comment on ticket #52

    Very exciting! Will try this and let you know how it works for me. Stuff like &amp;amp; are very unlikely edge cases I wouldn't worry about for normal use. Hope other people will test it too and shed light on problems that are likely, if any.

  • olalav olalav posted a comment on ticket #52

    I doubt a change in performance would have any significant real world impact. I think the correct thing is to always decode and trim. Not doing is relying too much on the HTML, which will break in other ways if the HTML changes. As for breaking code, decoding a string that is already decoded will practically always return the same string. So I think you could actually get away with just changing the code. Unless a lot of people really expect a lot of non-trimmed, non-decoded strings. I would shout...

  • olalav olalav posted a comment on ticket #52

    Actually, trim() would also be a desired default. I have never needed not to remove surrounding whitespace from an accessed value (unless when assuming/hoping there's never going to be any).

  • olalav olalav created ticket #52

    Always decode content values from the DOM tree

  • olalav olalav posted a comment on ticket #168

    Works now. Either it was the limit, or the source code changed. Will let you know if something similar happens.

  • olalav olalav created ticket #168

    Wikipedia breaks the parser

  • olalav olalav created ticket #167

    Removed elements aren't properly removed

  • olalav olalav posted a comment on ticket #163

    I think what I mostly would expect is the plain text to look like the text as displayed in the browser, in other words, a single whitespace no matter how many whitespaces are in the source. Possibly, there may be instances where multiplace whitespace are desired (like you hint at), but I can't think of any at the moment. Of course the replace will work around the problem.

  • olalav olalav posted a comment on ticket #163

    Seems to work! However, the example below creates multiple whitespace where there should be only one. I don't know if this is a different bug, but it's at least somewhat related. $str = '<p>I am saying <!-- --> <span> <a href=""> Hello World </a> </span> <!-- --> to you.</p>'; $html = str_get_html($str); echo $html->find("p", 0)->plaintext . "\n"; I am saying Hello World to you.

  • olalav olalav posted a comment on ticket #164

    Under "Files", the latest version (1.8.1) of simplehtmldom is dated 2019-01-13 (three weeks old), which seemed outdated. So I did svn checkout https://svn.code.sf.net/p/simplehtmldom/code/trunk simplehtmldom-code, which I assumed would give me the latest version. I now see that this gave me an ancient version @version 1.5 ($Rev: 210 $), for unknown reasons. I also now noticed that there is a Git repository, which puzzles me, as I thought SourceForge had no affiliation with Git. When cloning this,...

  • olalav olalav created ticket #164

    Fatal error: Stream does not support seeking

  • olalav olalav posted a comment on ticket #163

    Similarly, the following produces "World.Hello". Should produce "World. Hello". $file = '<a href="">World. </a>Hello';

  • olalav olalav created ticket #163

    Missing whitespace in plaintext property

  • olalav olalav posted a comment on ticket #118

    Isn't there a profiling/kernel expert on the team or somewhere in the community?...

  • olalav olalav created ticket #118

    Accessible and fast CLI-mode

  • olalav olalav created ticket #157

    $html->find("*") does not find all tags

  • olalav olalav created ticket #556

    Cursor starts running off / Unable to reset emulator

  • olalav olalav posted a comment on ticket #29

    My mistake this time. I forgot ./configure --with-readline. Now sqsh works with cursor...

1 >