From: Matthew M. <ma...@tu...> - 2006-03-23 18:09:09
|
Hi all, I have written the first draft of the Key class documentation. View it here: http://res.stddev.appstate.edu/cvs/fallout/docs/Key.txt?rev=1.1 I realize there are probably grammar problems and the like but I wanted to get it up for people to read. My apologies to the students and staff that had to learn how to use it the hard way. Matt -- Matthew McNaney Electronic Student Services Appalachian State University http://phpwebsite.appstate.edu |
From: Eloi G. <el...@re...> - 2006-03-27 23:00:36
|
Hi, Matt! I reviewed the file and some (probably stupid) questions came up: Will there be a problem 2 years down the road when we start having sites with 500,000 key records? How much longer will it take to search the Key table then? Is there a way we can decrease the indexing space requirements for the "module" field? Internal integer-based moduile ids? Using an enumeration field? (well, that would be pretty much the same thing) Why are we not using the user ids of the author & editors? I think we should because I've had to change the usernames of 3 people over the last 2 years, and it was a very long process. I think it breaks some sort of database normilization rule as well. -Eloi George- |
From: Matthew M. <ma...@tu...> - 2006-03-28 15:46:37
|
On Mon, 2006-03-27 at 19:06 -0500, Eloi George wrote: > Will there be a problem 2 years down the road when we start having sites > with 500,000 key records? How much longer will it take to search the > Key table then? 500,000 times longer than if there was just one record ;-) Seriously, I'm not sure. However, this table is indexed by the id and modules using the key class keep this value in their own table. That should be the only column searched on a regular basis. I am certainly not an expert on optimization, but it would seem faster to have one table of items than several tables with duplicate information. > Is there a way we can decrease the indexing space requirements for the > "module" field? Internal integer-based moduile ids? Using an > enumeration field? (well, that would be pretty much the same thing) This would be a huge change and set me back a ways. Many processes depend on the module title and I would have restructure the program to use an integer index instead. > Why are we not using the user ids of the author & editors? I think we > should because I've had to change the usernames of 3 people over the > last 2 years, and it was a very long process. I think it breaks some > sort of database normilization rule as well. Two reasons I did it this way. First, if I store the id, I have to join with the users table to grab the display name. Not a big deal, but it is certainly easier for developers to just have that information local. Second, I was worried about deleted users. If a user has been removed, their name will still be associated with the key. With an id, I would get a FALSE when trying to pull the display name. I can change it, but then getting the display name would require an extra database query per key listing. I am open to discussing options. For example, perhaps just adding creator_id and updater_id columns? Thanks for the feedback Eloi, Matt -- Matthew McNaney Electronic Student Services Appalachian State University http://phpwebsite.appstate.edu |
From: Eloi G. <el...@re...> - 2006-03-30 17:34:31
|
Matthew McNaney wrote: > Seriously, I'm not sure. However, this table is indexed by the id and > modules using the key class keep this value in their own table. That > should be the only column searched on a regular basis. I am certainly > not an expert on optimization, but it would seem faster to have one > table of items than several tables with duplicate information. It's only faster for this query: SELECT * FROM keys WHERE key_id IN (8,434,353543,...,4434322) where the records involved have different module names (but you can't search on that). If they're all from the same module, all efficiency is lost because of the unnecessarily large index. On a very large Key table phpWS will slow down when any of the following happens: - you include anything other than an integer-indexed column in the WHERE clause, - you include more than one column in the WHERE clause, - your db server uses a hard drive that's optimized for high throughput instead of fast seeks. It's more efficient to break the table up into module-specific subsets, eg. "mod_<module_name>_key" because the majority of queries generated are for either 1) a specific key_id where the module is known or 2) a search across key records (for active, create_date, update_date, etc.) where the module is known. Only a couple modules will search across different modulenames and even then they only need results from a subset of the installed modules on the site. These searches can still be done just as efficiently with UNION ALL clauses. *More* efficiently if you count the fact that indexes for unwanted modules are automatically excluded. >> Why are we not using the user ids of the author & editors? I think we >> should because I've had to change the usernames of 3 people over the >> last 2 years, and it was a very long process. I think it breaks some >> sort of database normilization rule as well. >> > Two reasons I did it this way. First, if I store the id, I have to join > with the users table to grab the display name. Not a big deal, but it is > certainly easier for developers to just have that information local. > Second, I was worried about deleted users. If a user has been removed, > their name will still be associated with the key. With an id, I would > get a FALSE when trying to pull the display name. > But, users should never be deleted, just made inactive. Let's say you delete user "jack33" & a year later someone else registers with that name. That person automatically inherits all forum posts, articles, calendar entries, etc. of the previous "jack33". Chances are that they'll start deleting/changing all that old content, because its not theirs. Indexing content on user_id will assure that that doesn't happen. Even if the siteadmin decides to reuse names, a JOINed query would still be able to tell the difference between "jack33" and "jack33 (dead user)" > I can change it, but then getting the display name would require an > extra database query per key listing. I am open to discussing options. > For example, perhaps just adding creator_id and updater_id columns? That sounds great! "creator" and "updater" could remain as cached data as long as the Coding Practices indicate that ownership & permissions checks must always be made against "creator_id", as the username held in "creator" may be incorrect, obsolete or a duplicate. creator_id should be indexed though, as many queries use this in the WHERE clause. -Eloi George- |
From: Eloi G. <el...@re...> - 2006-05-02 18:38:58
|
Hi Matt! I never heard a reply on this message. Just wanted to make sure you didn't miss it. Cheers, Eloi Eloi George wrote: > Matthew McNaney wrote: >> Seriously, I'm not sure. However, this table is indexed by the id and >> modules using the key class keep this value in their own table. That >> should be the only column searched on a regular basis. I am certainly >> not an expert on optimization, but it would seem faster to have one >> table of items than several tables with duplicate information. > > It's only faster for this query: > SELECT * FROM keys WHERE key_id IN (8,434,353543,...,4434322) > where the records involved have different module names (but you can't > search on that). If they're all from the same module, all efficiency > is lost because of the unnecessarily large index. > > On a very large Key table phpWS will slow down when any of the > following happens: > - you include anything other than an integer-indexed column in the > WHERE clause, > - you include more than one column in the WHERE clause, > - your db server uses a hard drive that's optimized for high > throughput instead of fast seeks. > > It's more efficient to break the table up into module-specific > subsets, eg. "mod_<module_name>_key" because the majority of queries > generated are for either 1) a specific key_id where the module is > known or 2) a search across key records (for active, > create_date, update_date, etc.) where the module is known. > Only a couple modules will search across different modulenames and > even then they only need results from a subset of the installed > modules on the site. These searches can still be done just as > efficiently with UNION ALL clauses. *More* efficiently if you count > the fact that indexes for unwanted modules are automatically excluded. > >>> Why are we not using the user ids of the author & editors? I think >>> we should because I've had to change the usernames of 3 people over >>> the last 2 years, and it was a very long process. I think it breaks >>> some sort of database normilization rule as well. >>> >> Two reasons I did it this way. First, if I store the id, I have to join >> with the users table to grab the display name. Not a big deal, but it is >> certainly easier for developers to just have that information local. >> Second, I was worried about deleted users. If a user has been removed, >> their name will still be associated with the key. With an id, I would >> get a FALSE when trying to pull the display name. >> > > But, users should never be deleted, just made inactive. Let's say you > delete user "jack33" & a year later someone else registers with that > name. That person automatically inherits all forum posts, articles, > calendar entries, etc. of the previous "jack33". Chances are that > they'll start deleting/changing all that old content, because its not > theirs. Indexing content on user_id will assure that that doesn't > happen. Even if the siteadmin decides to reuse names, a JOINed query > would still be able to tell the difference between "jack33" and > "jack33 (dead user)" > >> I can change it, but then getting the display name would require an >> extra database query per key listing. I am open to discussing options. >> For example, perhaps just adding creator_id and updater_id columns? > > That sounds great! "creator" and "updater" could remain as cached > data as long as the Coding Practices indicate that ownership & > permissions checks must always be made against "creator_id", as the > username held in "creator" may be incorrect, obsolete or a duplicate. > > creator_id should be indexed though, as many queries use this in the > WHERE clause. > > -Eloi George- > |
From: Matthew M. <ma...@tu...> - 2006-05-02 20:54:51
|
This is the first I have seen of it. I am looking into your concerns and will write up my thoughts tomorrow. Real quick: >From what I have looked at the most common queries to phpws_key are: where id = ? where module = ? where id = ? and restricted = ? Understand rewriting key is a major undertaking so I must be positive the delay would be worth it. More tomorrow, Matt On Tue, 2006-05-02 at 14:46 -0500, Eloi George wrote: > Hi Matt! > I never heard a reply on this message. Just wanted to make sure you > didn't miss it. > Cheers, Eloi > > > > Eloi George wrote: > > Matthew McNaney wrote: > >> Seriously, I'm not sure. However, this table is indexed by the id and > >> modules using the key class keep this value in their own table. That > >> should be the only column searched on a regular basis. I am certainly > >> not an expert on optimization, but it would seem faster to have one > >> table of items than several tables with duplicate information. > > > > It's only faster for this query: > > SELECT * FROM keys WHERE key_id IN (8,434,353543,...,4434322) > > where the records involved have different module names (but you can't > > search on that). If they're all from the same module, all efficiency > > is lost because of the unnecessarily large index. > > > > On a very large Key table phpWS will slow down when any of the > > following happens: > > - you include anything other than an integer-indexed column in the > > WHERE clause, > > - you include more than one column in the WHERE clause, > > - your db server uses a hard drive that's optimized for high > > throughput instead of fast seeks. > > > > It's more efficient to break the table up into module-specific > > subsets, eg. "mod_<module_name>_key" because the majority of queries > > generated are for either 1) a specific key_id where the module is > > known or 2) a search across key records (for active, > > create_date, update_date, etc.) where the module is known. > > Only a couple modules will search across different modulenames and > > even then they only need results from a subset of the installed > > modules on the site. These searches can still be done just as > > efficiently with UNION ALL clauses. *More* efficiently if you count > > the fact that indexes for unwanted modules are automatically excluded. > > > >>> Why are we not using the user ids of the author & editors? I think > >>> we should because I've had to change the usernames of 3 people over > >>> the last 2 years, and it was a very long process. I think it breaks > >>> some sort of database normilization rule as well. > >>> > >> Two reasons I did it this way. First, if I store the id, I have to join > >> with the users table to grab the display name. Not a big deal, but it is > >> certainly easier for developers to just have that information local. > >> Second, I was worried about deleted users. If a user has been removed, > >> their name will still be associated with the key. With an id, I would > >> get a FALSE when trying to pull the display name. > >> > > > > But, users should never be deleted, just made inactive. Let's say you > > delete user "jack33" & a year later someone else registers with that > > name. That person automatically inherits all forum posts, articles, > > calendar entries, etc. of the previous "jack33". Chances are that > > they'll start deleting/changing all that old content, because its not > > theirs. Indexing content on user_id will assure that that doesn't > > happen. Even if the siteadmin decides to reuse names, a JOINed query > > would still be able to tell the difference between "jack33" and > > "jack33 (dead user)" > > > >> I can change it, but then getting the display name would require an > >> extra database query per key listing. I am open to discussing options. > >> For example, perhaps just adding creator_id and updater_id columns? > > > > That sounds great! "creator" and "updater" could remain as cached > > data as long as the Coding Practices indicate that ownership & > > permissions checks must always be made against "creator_id", as the > > username held in "creator" may be incorrect, obsolete or a duplicate. > > > > creator_id should be indexed though, as many queries use this in the > > WHERE clause. > > > > -Eloi George- > > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Phpwebsite-developers mailing list > Php...@li... > https://lists.sourceforge.net/lists/listinfo/phpwebsite-developers -- Matthew McNaney Electronic Student Services Appalachian State University http://phpwebsite.appstate.edu |
From: Shaun M. <sh...@ae...> - 2006-05-02 22:03:13
|
On 2 May 2006, at 21:52, Matthew McNaney wrote: > > Understand rewriting key is a major undertaking so I must be positive > the delay would be worth it. > I'll second Eloi's call for separate key tables for each module. Most modules won't be a problem but there's one specific module which needs all the help it can get and that's ANY forum module. Potentially you've got millions of posts in a forum. Inefficient queries and searches can bring a site to it's knees. Keeping the queries as efficient as possible and using the indexes is very important. While I'm on searches. If you're re-doing searching, a time limited search would be useful so that, by default at least, a user isn't searching every single item ever published on a site. I found that on a site with 3-4 years of cruft, and a few hundred thousand forum posts, just setting the default for a search to 'the last 30 days' instead of 'since the dawn of time' resulted in a load less resources being used. A little bit of social engineering goes a long way to make up for cheap-ass servers (not that I have any of those now) or bad code. Shaun aegis design - http://www.aegisdesign.co.uk aegis hosting - http://www.aegishosting.co.uk |