From: T.J.Hunt <T.J...@op...> - 2008-02-18 17:44:17
|
David Heath wrote: > > T.J.Hunt wrote: > > > Adding the U flag will change the effect of a few latin-1 regexps. > > > > Can anyone think of any problematic cases? If not, I think we should > > just plough on and include it. > > the problems will arise with strings which are valid latin-1 but not > valid utf-8. > > Rules of UTF-8 encoding (from wikipedia): > > [Snip] > > So, if we just added the u flag, then pages containing such a > sequence would no longer parse correctly, perhaps not at all. > > Surely the issue is that really simpletest needs an awareness of the > encoding of its input documents (ie the html pages it's reading). It This is not the issue here. The issue here is solely regular expression pattern matching. Input encoding is a very important issues, but is not relevant to the U flag on regular expressions issue. Most of the time, PHP strings treated as arrays of bytes, and it does not matter to PHP what actual characters those bytes represent. Normally this is fine (for example strpos works as expected) as long as you yourself are consistent about you character sets, and these days, the most sensible choice is to do everything as UTF-8. However, not everything is fine, for example substr is not safe, becuase it will count bytes, not characters. On the other hand, substr($str, strpos(...), strlen(...)) is safe. Also, things like stripos are not Unicode safe, because some languages covered by unicode have very complicated case rules. So, parsing affects how your input is turned into a sequence of bytes in memory. Regexp matching is about how two sequences of bytes in memory (a string and a pattern) are compared. Tim. |