[Mantisbt-dev] thoughts on escaping

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I talked to Ken and Jeroen on IRC about escaping strings and played a 
bit with it in my working copy.

Here are my observations:

1) where magic_quotes_gpc is turned on we should simulate turning it off 
(basically just check for it being on when getting a POST variable and 
strip the slashes off)

2) API functions should escape all strings using mysql_escape_string() 
(unless we're moving to ADODB of course) before inserting

3) API functions should unescape all strings using stripslashes() before 
returning SELECTed data

------
Goal +
------

The result of these three policies is that all data in pages (once all 
the SQL is pulled out into API functions) is unescaped and we know 
exactly what we're dealing with.  We can pull something out of the DB 
and know that we can safely pass it into another API function to be 
inserted somewhere else.  We can insert a string that came in as a POST 
variable or a hardcoded string and know we'll have the same results.

----------------
Implementation +
----------------

(1) can be easily implemented if we use get_var_*() functions (whatever 
they may be called)... see my other post "getting variables with functions"

(2) Can be easily implemented.  I created db_escape_string(), 
db_escape_int() and db_escape_bool() which convert and escape as 
necessary for each type.  I'm not sure if the 3 functions are necessary 
but it again documents what type we expect things to be and makes sure 
that data is valid for that type.  Might solve problems later...

(3) Is relatively easy to implement though it sometimes needs a loop 
over the result of a query to unescape each field.  I had created 
db_unescape_string(), _int(), and _bool() as well to handle the process 
in reverse but I'm undecided whether it's worth it.  Again it would be 
nice to have the variables be the right type but we're pretty sure the 
data is valid because it was checked when it was put in the DB. <shrug>

----------
Concerns +
----------

(I) The main concern here is existing installations with data in the DB 
that has been processed by the old method.  I don't believe this is 
actually a big problem and my tests seemed to indicate the same thing:

* Data with slashes in the DB will still be unslashed when it's removed
* Data without slashes will be okay because stripslashes() can be called
   multiple times safely
* Data with html entities in the DB should still be displayed properly
   by the browser since it's the thing that is supposed to understand
   entities anyway.  When the data is submitted again it will be written
   back to the DB without the entities - a nice smooth upgrade path.

(II) The other concern I have is that a lot more strings will have to 
have entities put into them before being printed because they aren't put 
in before being inserted in the DB.  I think this is supposed to be 
happening with display_string() already but I'm not sure if it is 
throughout.  Maybe we can write an equivalent of echo() that calls 
htmlentities() before printing the string to make things shorter...

Also, now that I think about it, this will cause existing entities (from 
issue I, above) to be "double escaped".  i.e. "black & red" would have 
been stored in the DB by the old system as "black &amp; red" and would 
now get displayed as "black &amp;amp; red".  I suppose we could add 
something to the upgrade scripts to pull the data out, strip the 
entities, and put it back in...

------------
Misc Notes +
------------

(a) Pages that are obtaining POST data that should have html tags, etc 
removed need to do that themselves.  We need to evaluate the existing 
functions to see whether they are still correct and appropriate.

(b) Strings need to have some characters replaced with HTML entities 
before being placed into text boxes, etc.  I would suggest we do this as 
the variable is being printed so that it is close to the tag and is easy 
to make sure we are using the right function.  It also means we don't 
have to come up with another variable prefix to indicate HTML-ized 
variables.  We could come up with short names like textbox($foo), 
textarea($foo), etc so it doesn't get to cluttered.

(c) our use of string_display() is a bit inconsistent.  I'm not sure if 
it's always being called, and it's being called on variables like 
$v_version.  I'm assuming we wouldn't want to replace the #3 in "Beta 
#3" with a link to bug #3.  We probably want to break out the stuff that 
adds those links into a separate function that we can call only when we 
actually want it.  Or maybe we really do *always* want it...?

------------
Conclusion +
------------

This raises all the isses I've been able to come up with and suggests 
solutions where I've been able to come up with any.  I think all we need 
to do is hash this out and come up with a clear policy of exactly what 
kind of data should exist where.  My preference is:

* Escaped (but otherwise unmodified) data in the DB
* Clean data in the pages
* Clean data given to API functions
* Clean data returned from API functions
* HTML entity data printed only into the html

Anyway, if you've made it this far, congratulations.  Comments, as 
always are very much appreciated.  Just trying to generate discussion so 
we can drill down to what we need to do to clean things up a bit.

Bed now,

Julian

-- 
ju...@be...
Beta4 Productions (http://www.beta4.com)