Re: [Simpletest-support] working on adding U flag to regexestoallowunicode

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

David Heath wrote:
> > T.J.Hunt wrote:
> > > Adding the U flag will change the effect of a few latin-1 regexps.
> > 
> > Can anyone think of any problematic cases? If not, I think we should
> > just plough on and include it.
> 
> the problems will arise with strings which are valid latin-1 but not
> valid utf-8. 
> 
> Rules of UTF-8 encoding (from wikipedia):
> 
> [Snip]
> 
> So, if we just added the u flag, then pages containing such a
> sequence would no longer parse correctly, perhaps not at all. 
> 
> Surely the issue is that really simpletest needs an awareness of the
> encoding of its input documents (ie the html pages it's reading). It

This is not the issue here. The issue here is solely regular expression
pattern matching.

Input encoding is a very important issues, but is not relevant to the U
flag on regular expressions issue.

Most of the time, PHP strings treated as arrays of bytes, and it does
not matter to PHP what actual characters those bytes represent. Normally
this is fine (for example strpos works as expected) as long as you
yourself are consistent about you character sets, and these days, the
most sensible choice is to do everything as UTF-8.

However, not everything is fine, for example substr is not safe, becuase
it will count bytes, not characters. On the other hand, substr($str,
strpos(...), strlen(...)) is safe.

Also, things like stripos are not Unicode safe, because some languages
covered by unicode have very complicated case rules.

So, parsing affects how your input is turned into a sequence of bytes in
memory. Regexp matching is about how two sequences of bytes in memory (a
string and a pattern) are compared.

Tim.