Re: [limesurvey-developers] Alternative Database Design
The leading Open Source survey tool
Brought to you by:
c_schmitz
From: Macasek, M. A. <mma...@mi...> - 2007-09-12 12:57:25
|
Hello all, Based on this email and some other ones that were sent directly to me the general consensus is to use a horizontal approach. Now that I have a better understanding of the discussion that took place/is taking place I will try to sit down with some db folks here at MITRE to see how they feel and potentially run some additional tests.=20 One important point to make with regard to the response table is that the table is fairly isolated in swapping it out for a different approach (vertical or horizontal) in the future will not be that complicated. I would like to shift the focus onto the entire db schema for a moment. As I previously mentioned I have taken some time to review the proposed schema and I have a series of suggestions that I would like to make. Some of these suggestions deal with refactoring of somewhat complex tables and some to do with providing more flexibility in the data model for a wider range of potential questions.=20 I have attached 3 documents. These documents contain an initial db schema that incorporates many of these ideas. The sql file is the schema definition that can be dropped into a mysql database. The PDF is a visual representation of the model. Finally the doc is a schema walk though that also provides some motivation for some of the decisions I made.=20 Please take some time to review this schema and let me know your feedback. This schema does include the vertical response table but it does not require it (some complications may arise with a horizontal table vs. the vertical one in this new schema which I have pointed out in the doc). I am going to try and get on IRC later so hopefully we can discuss this further. Michael -----Original Message----- From: lim...@li... [mailto:lim...@li...] On Behalf Of dol...@in... Sent: Wednesday, September 12, 2007 5:12 AM To: lim...@li... Subject: Re: [limesurvey-developers] Alternative Database Design Hello Everyone, Just to introduce myself, i'm David Olivier, and i've been a developer on this projects for a long time now. I would just like to give my opinions about this. Horizontal Pros: 1. Easy to work with internally, both in retrieving and storing. 2. Easy for users to develop software that work with the results. (which i think quite a few do) 3. Can add / remove question with preset number of fields fairly easy on the fly. Horizontal Cons: 1. Questions with a variable number of answers are impossible to cater for. I will use a looping question example here, where a question can loop x amount of times, its unrealistic to add a column when there x > columns allocated. Vertical Pros: 1. Most flexibility with modifying a live survey. 2. Good scalability. 3. Finding text data for coding purposes is quite easy. Vertical Cons: 1. Large tables. 2. Losing a table will corrupt multiple surveys. 3. Have to convert vertical data into horizontal data through sql queries to work with the data. So where do i stand on this... I have been working with the horizontal structure for quite some times, and have no issues with it. Allot of our users develop their own tools to work with the results, some only use it for a data collection. I have read carsten's mail, and i agree that adding columns on the fly is a good simple and straight forward approach, but i think this will only work decently for question types that only take up a couple of columns. For example if i where to add 20 more answer on a multi response question on a live survey (assuming it was busy), then yes i believe this will be a problem. And also the possibility that a the answers of a question can grow for a looping question. Seems i'm not really getting to a point here, and just going through things in my head (i'm very tired today). So here is my suggestion: We use horizontal storage, but we only use only one field per question. The software should make sure there is always a surplus of fields available for mapping. - In this case, we would only do an alteration to the table when we run out of columns, so then we can add another x amount of columns. We store xml data for the question inside these fields. (thus keeping it one column a question) We can allow for custom field mapping? - if we want to, since all questions will be mapped to fields. With this approach, we would have as many columns as questions, rather than having as many columns as answers. This will save us allot on db performance. Storing data in xml will give us flexibility in adding metadata to answers received. Catering for custom field mapping will give our users flexibility in using the software for their specific needs. Regards, David Olivier Macasek, Michael A. wrote: > > All, > > Well this is my second attempt to write this email, MS Outlook Web > Mail decide it had timed out so when I hit send the last time my > rather long email was completely lost! So let me try again.... > > Sorry for not participating in the discussions on IRC. MITRE has IRC > blocked so I am unable to connect. I am working on a solution to this > problem so hopefully I will be able to participate in the future. > Carsten suggested that I tunnel out but given the sensitivity of > MITRE's work I would be in a world of trouble! I should have a > temporary solution in place tomorrow so I can participate in the > discussions with a more long term solution in the works. > > I have reviewed your comments below and I feel that you are correct on > most of your points. Rather than address all the points below made by > Rob and Carsten I would like to take a step back and try to understand > why it is currently thought that a horizontal response table is > preferred over a vertical one. I am sure you have had this > conversation many times but I just want to try and understand the pros > and cons for each approach that were discussed. > > Based on my past experience I believe I can anticipate some of your > pros and cons: > > Horizontal: retrieving of a users response set is very simple, just > get the row and you have everything. Problems arise as soon as a user > creates a new survey with more question/answers than there are columns > in the table so it needs to be altered in real-time. I see tests were > done that showed performance while altering a table is not bad. The > problem with this test is it was done with just one user. Try altering > the table in real-time while 400 users are taking a survey and adding > records to that table. A point was also made that we wanted to avoid > corruption, in my experience that is generally easy to do, you only > tend to run into corruption problem if a user was in the db messing > around. After all this is what dbs are designed to do/avoid. > Dynamically altering tables in real-time introduces performance and > data integrity issues. I believe this is what Rob was trying to > address when he suggested that we limit the response table to 100 > columns. However this is not a real solution either in that you now > have a partly horizontal and partly vertical table. Why not go all > out, go vertical. > > Vertical: this would provide fast inserts and avoids the dynamic > altering of tables. Retrieving a users response set would be a little > more involved with this type of table. I imagine that it was brought > up that a vertical table will have performance issues when the table > gets a lot of records. I would like to address the first point; fast > inserts, this is a must for this type of application. This application > requires that we quickly record the users results and move on. The > point of the tool is to collect users responses so priority should be > given to inserts. Slightly slower reporting in favor of quick inserts > in acceptable in this case. The argument that the vertical table will > be slower, in my view, can be over come with some db optimization > (indexes, query optimization, 'smarter' stored procedures/functions, etc). > > In my experience a vertical response table is the way to go for this > type application. I would like to present this problem to some MITRE > db folks to see what they think. Before I do that I would first like > to be sure I have not missed any parts of this discussion. Please let > me know what I missed. > > In addition to the above comments I have been reviewing the proposed > db schema and I have several suggestion for improvements. I am still > working on finalizing those comments, I should be done with it > tomorrow morning (GMT-5) so I will send them out then. > > Here is some additional contact info for me: > Work Phone: 1-781-271-6321 > Work Email: mma...@mi... > Personal Email: ma...@gm... > AIM: michaelmacasek > Jabber: ma...@gm... > > If other protocols are being used let me know and I will get accounts > there too. > > Just a quick poll; where is everyone from? I'm from Boston, MA USA. > > Talk to you tomorrow, > > Michael > > > -----Original Message----- > From: lim...@li... on behalf of > Carsten Schmitz > Sent: Tue 9/11/2007 2:49 PM > To: lim...@li... > Subject: Re: [limesurvey-developers] Alternative Database Design > > Hello Robert, Hello Michael, > > we had a long discussion on IRC about your proposed data model. I think > it is an interesting idea indeed. > However we found a few flaws which we would like to point out and > propose another solution. > > > We see a few disadvantages there: > > - Forcing tables to have always the same number of columns - that large > tables could lead to preformance degradation. Even if I should have a > survey that is very small I still have tables with 100 columns in there > - for each datatype. > - Data overlapping into the next row: This make selection very tricky > with survey that extends the 100 column limit. It adds the unwanted > vertical dimension here and you would still have to select vertically > here for special queries. This porblem adds because most surveys always > feature same kind of questions and so same kind of datatypes... normally > the 100 column limit will be hit most of the time with medium survey > sizes. > - Field mapping: By mapping the result rows to an result main table you > add another step of complexity slowing down selection. > - Selection difficulty: You can select horizontally. But for evaluation > purposes you will have a complex statement to bring the data from 4 > different tables together plus the mapping select before that. So it not > horizontally but still vertica.. only that the horizontal space is used > better > - I assume that you want to put all responses off all surveys into these > tables. You will have _a lot_ of data on your hand after a short time -=20 > and the worst is that there is no separation of data. Assume that the > tables get corrupted for any reason - a line of code that writes data to > the wrong column. ALL of your surveys will be corrupted at once and > maybe even unnoticed for some time. > > > > How about this easy solution: > > Adding columns on the fly as needed for the result table - one or more > result tables per survey. > > I know that adding columns on the fly seems to be a very big "no-no" . > But why not? Yes.. adding columns on a big result table may be slow . > But most of the time all the possible outcomes are explored with the > first 1000 surveys. I did performance test on a 100,000 records table > with 40 fields of all different kinds and it needed like 2 seconds max > for adding a columns on an old Xenon 933 Mhz server. It will be even > faster on a decent server being up-to date. > > And for the maximum limit on columns.. just lap over into an additional > result table if needed. > Still you can connect these multiple result tables with a simple select > and a join statement. > > We can think about allowing the administrator of a survey to add columns > on the fly too or to delete these or mark these as deleted. (give him > the option: do you want to delete this question and the results for this > question? Or just want to make this question invisible/mark as deleted?) > > Advantages: > > - Flexible - Expandable on the fly > - Select data straight forward from the table(s) > - 'Unlimited' columns > > > Another thing: > Your biggest point is the possibility to save unlimited answers. > The problem I see with this discussion is: Where do we set our limit? > What is the exact goal? There is no "unlimited". It always a > theoretical thing. You can say the number of question is at least > limited by the server itself and its storing capacity so we may never > put the word unlimited on anything. > > Additionally the claim that every table is creating a file with MyIsam > tables is true but it is not really problem... even with thousands and > thousands of surveys the file space won't give up on it - not even on > Windows :-). Most other database do not use files per table so it is > only MyISAM. > > Please consider seriously our suggestion. > > Btw... it would be really great if you guys could join IRC and discuss > with us... normally around 18:00 GMT you will find most of the > developers there on weekday and immediate feedback is a good thing for > discussion. Discussing by email can be painfully slow and strenuous. > > Best regards > > Carsten > > > > > > Thew, Robert wrote: > > Hello all, > > > > Here's an alternative db design that addresses some of the limits of > > the current version. This is not a complete re-working; it just > > concerns the tables that save the responses. It attempts to fix the > > problems of the auto-generated response tables, but still maintain the > > efficiency. > > > > The idea is to have a response table for each data type, like Integer, > > Text, Data. Each response type table would store 100 columns of data > > per row, and each survey group would map its questions to the response > > type tables. > > > > There would be a RESPONSE table which would store all the meta info > > about each survey response: the user, the survey, the timestamp. The > > RESPONSE_ID from this table would be the foreign key used in all the > > response tables. > > > > For instance, the RESPONSE_INT table would be defined: > > RESPONSE_ID | ROW_ID | ORDER_ID | COL_1 | COL_2 | ... | COL_100 > > > > The GROUP_RESPONSE_MAP table would be: > > GROUP_ID | QUESTION_ID | TYPE | ROW | COLUMN > > > > This would map each question in a group to a particular row and column > > in the response table for the question's type. > > > > This fixes a couple of problems. The number of answers allowed in a > > survey is no longer limited by the amount of columns allowed in a table > > by the database. If there are more than 100 questions of a type, then > > the response takes up another row. The number of tables in the > > tablespace will not grow as surveys are created and modified either. > > Now as surveys are created, data is added to tables, not tables to > > tablespaces. This should also make it easier to share data between > > LimeSurvey instances, because it's only data that changes, not schemas. > > > > This also allows for the possibility of looping in a survey. By mapping > > the individual groups to the responses, any group in a survey can be > > looped, and be answered multiple times without an artificial limit > > placed on the number of responses. > > > > At the same time, it should come close to the efficiency of the > > original design, because data is still being read horizontally, and the > > number of queries won't grow beyond the number of data types in the > > survey. > > > > Thanks, > > Robert Thew > > > > Ps. Could one of the project admins add me to the project in > > SourceForge as a developer? My user name is robthew. > > > > > ----------------------------------------------------------------------- -- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > limesurvey-developers mailing list > > lim...@li... > > https://lists.sourceforge.net/lists/listinfo/limesurvey-developers > > > > =20 > > ----------------------------------------------------------------------- -- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > limesurvey-developers mailing list > lim...@li... > https://lists.sourceforge.net/lists/listinfo/limesurvey-developers > > ----------------------------------------------------------------------- - > > ----------------------------------------------------------------------- -- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ----------------------------------------------------------------------- - > > _______________________________________________ > limesurvey-developers mailing list > lim...@li... > https://lists.sourceforge.net/lists/listinfo/limesurvey-developers > =20 ----------------------------------------------------------------------- -- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ limesurvey-developers mailing list lim...@li... https://lists.sourceforge.net/lists/listinfo/limesurvey-developers |