From: Julian F. <ju...@be...> - 2002-08-22 08:50:59
|
I talked to Ken and Jeroen on IRC about escaping strings and played a bit with it in my working copy. Here are my observations: 1) where magic_quotes_gpc is turned on we should simulate turning it off (basically just check for it being on when getting a POST variable and strip the slashes off) 2) API functions should escape all strings using mysql_escape_string() (unless we're moving to ADODB of course) before inserting 3) API functions should unescape all strings using stripslashes() before returning SELECTed data ------ Goal + ------ The result of these three policies is that all data in pages (once all the SQL is pulled out into API functions) is unescaped and we know exactly what we're dealing with. We can pull something out of the DB and know that we can safely pass it into another API function to be inserted somewhere else. We can insert a string that came in as a POST variable or a hardcoded string and know we'll have the same results. ---------------- Implementation + ---------------- (1) can be easily implemented if we use get_var_*() functions (whatever they may be called)... see my other post "getting variables with functions" (2) Can be easily implemented. I created db_escape_string(), db_escape_int() and db_escape_bool() which convert and escape as necessary for each type. I'm not sure if the 3 functions are necessary but it again documents what type we expect things to be and makes sure that data is valid for that type. Might solve problems later... (3) Is relatively easy to implement though it sometimes needs a loop over the result of a query to unescape each field. I had created db_unescape_string(), _int(), and _bool() as well to handle the process in reverse but I'm undecided whether it's worth it. Again it would be nice to have the variables be the right type but we're pretty sure the data is valid because it was checked when it was put in the DB. <shrug> ---------- Concerns + ---------- (I) The main concern here is existing installations with data in the DB that has been processed by the old method. I don't believe this is actually a big problem and my tests seemed to indicate the same thing: * Data with slashes in the DB will still be unslashed when it's removed * Data without slashes will be okay because stripslashes() can be called multiple times safely * Data with html entities in the DB should still be displayed properly by the browser since it's the thing that is supposed to understand entities anyway. When the data is submitted again it will be written back to the DB without the entities - a nice smooth upgrade path. (II) The other concern I have is that a lot more strings will have to have entities put into them before being printed because they aren't put in before being inserted in the DB. I think this is supposed to be happening with display_string() already but I'm not sure if it is throughout. Maybe we can write an equivalent of echo() that calls htmlentities() before printing the string to make things shorter... Also, now that I think about it, this will cause existing entities (from issue I, above) to be "double escaped". i.e. "black & red" would have been stored in the DB by the old system as "black & red" and would now get displayed as "black &amp; red". I suppose we could add something to the upgrade scripts to pull the data out, strip the entities, and put it back in... ------------ Misc Notes + ------------ (a) Pages that are obtaining POST data that should have html tags, etc removed need to do that themselves. We need to evaluate the existing functions to see whether they are still correct and appropriate. (b) Strings need to have some characters replaced with HTML entities before being placed into text boxes, etc. I would suggest we do this as the variable is being printed so that it is close to the tag and is easy to make sure we are using the right function. It also means we don't have to come up with another variable prefix to indicate HTML-ized variables. We could come up with short names like textbox($foo), textarea($foo), etc so it doesn't get to cluttered. (c) our use of string_display() is a bit inconsistent. I'm not sure if it's always being called, and it's being called on variables like $v_version. I'm assuming we wouldn't want to replace the #3 in "Beta #3" with a link to bug #3. We probably want to break out the stuff that adds those links into a separate function that we can call only when we actually want it. Or maybe we really do *always* want it...? ------------ Conclusion + ------------ This raises all the isses I've been able to come up with and suggests solutions where I've been able to come up with any. I think all we need to do is hash this out and come up with a clear policy of exactly what kind of data should exist where. My preference is: * Escaped (but otherwise unmodified) data in the DB * Clean data in the pages * Clean data given to API functions * Clean data returned from API functions * HTML entity data printed only into the html Anyway, if you've made it this far, congratulations. Comments, as always are very much appreciated. Just trying to generate discussion so we can drill down to what we need to do to clean things up a bit. Bed now, Julian -- ju...@be... Beta4 Productions (http://www.beta4.com) |