From: Rutger V. <rut...@gm...> - 2009-04-29 17:51:52
|
Hi all, Bill, Mark, Val and I just had a conference call on pending issues pre-beta. We decided there really aren't any: we're ready to start beta testing. The topic then moved to what to do after testing. One of the main issues is how we will move the actual data in TreeBASE2 from the SDSC database instance (i.e. DB2 sitting on a computer in San Diego) to the NESCent instance (i.e. PG sitting on a computer in Durham). One possibility is that we use the scripts we've been using to import TreeBASE1 data into TreeBASE2. Unfortunately, loading the data that way takes a fair amount of time (think weeks) and human intervention. A second possibility would be to write a program that, from NESCent, connects to the SDSC instance through JDBC and loads the data record by record. This might take a long time too, and it'll depend on the JDBC connection staying up for that entire time. To kick off the discussion I'd like to suggest a third possibility: we implement functionality where each table can be dumped to some delimited format (CSV, say); the dumps are made available for download as a compressed archive; the NESCent machine downloads that archive and loads the tables into PG. It seems to me that we want the dump+zip+serve up functionality anyway, so this would be a good way to make that happen. Any thoughts? Thanks, Rutger -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |
From: Rutger V. <rut...@gm...> - 2009-04-29 19:01:27
|
(Forwarding to mailing list.) On Wed, Apr 29, 2009 at 11:48 AM, Jon Auman <jon...@du...> wrote: > As the sysadmin, I've got a preference to #3. > Importing from a csv file in Postgresql is trivial. You also avoid import > aborts caused by postgresql errors and you always know at what stage you are > at during the import. > How big do you think the total dump files will be? For Postgresql, our data > dump sizes depend upon the type of data in the database. We've got a 400 MB > database with a fair amount of binary data that dumps to a 200 MB file, and > we've got a 1 GB database with no binary data that dumps to a 20 MB file. > Do you have an idea of the size of the current DB2 database on disk, and > what kind of data is in there (test or binary)? I don't know the exact size of the current database, but it's larger than your cases. > Also, will this be a one time operation or ongoing? One time-ish. Rutger > On Apr 29, 2009, at 1:51 PM, Rutger Vos wrote: > > Hi all, > > Bill, Mark, Val and I just had a conference call on pending issues > pre-beta. We decided there really aren't any: we're ready to start > beta testing. The topic then moved to what to do after testing. One of > the main issues is how we will move the actual data in TreeBASE2 from > the SDSC database instance (i.e. DB2 sitting on a computer in San > Diego) to the NESCent instance (i.e. PG sitting on a computer in > Durham). > > One possibility is that we use the scripts we've been using to import > TreeBASE1 data into TreeBASE2. Unfortunately, loading the data that > way takes a fair amount of time (think weeks) and human intervention. > > A second possibility would be to write a program that, from NESCent, > connects to the SDSC instance through JDBC and loads the data record > by record. This might take a long time too, and it'll depend on the > JDBC connection staying up for that entire time. > > To kick off the discussion I'd like to suggest a third possibility: we > implement functionality where each table can be dumped to some > delimited format (CSV, say); the dumps are made available for download > as a compressed archive; the NESCent machine downloads that archive > and loads the tables into PG. It seems to me that we want the > dump+zip+serve up functionality anyway, so this would be a good way to > make that happen. > > Any thoughts? > > Thanks, > > Rutger > > -- > Dr. Rutger A. Vos > Department of zoology > University of British Columbia > http://www.nexml.org > http://rutgervos.blogspot.com > > ------------------------------------------------------------------------------ > Register Now & Save for Velocity, the Web Performance & Operations > Conference from O'Reilly Media. Velocity features a full day of > expert-led, hands-on workshops and two days of sessions from industry > leaders in dedicated Performance & Operations tracks. Use code vel09scf > and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > > -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |
From: William P. <wil...@ya...> - 2009-04-29 20:04:42
|
On Apr 29, 2009, at 3:01 PM, Rutger Vos wrote: >> How big do you think the total dump files will be? For Postgresql, >> our data >> dump sizes depend upon the type of data in the database. We've got >> a 400 MB >> database with a fair amount of binary data that dumps to a 200 MB >> file, and >> we've got a 1 GB database with no binary data that dumps to a 20 MB >> file. >> Do you have an idea of the size of the current DB2 database on >> disk, and >> what kind of data is in there (test or binary)? > > I don't know the exact size of the current database, but it's larger > than your cases. I think DB2's footprint is on the order of 100 GB -- is that right? Our largest table is probably the matrix element data -- let's say that there are 3,000 matrices with an average of 100 taxa and 1000 bases -- that by itself is 300,000,000 records. The next biggest is probably the edges table -- we have about 5,500 trees, supposing the average tree has 80 leaves. That comes to about 5500 * ((80 * 2) - 1) = 875,000 edge records. After that, the taxon_variant table has about 550,000 records, etc. bp |
From: Val T. <va...@ci...> - 2009-04-29 20:27:46
|
MJD had the exact figures a few months ago. What are they, Mark? Val On Apr 29, 2009, at 4:04 PM, William Piel wrote: > > On Apr 29, 2009, at 3:01 PM, Rutger Vos wrote: > >>> How big do you think the total dump files will be? For Postgresql, >>> our data >>> dump sizes depend upon the type of data in the database. We've got >>> a 400 MB >>> database with a fair amount of binary data that dumps to a 200 MB >>> file, and >>> we've got a 1 GB database with no binary data that dumps to a 20 MB >>> file. >>> Do you have an idea of the size of the current DB2 database on >>> disk, and >>> what kind of data is in there (test or binary)? >> >> I don't know the exact size of the current database, but it's larger >> than your cases. > > I think DB2's footprint is on the order of 100 GB -- is that right? > > Our largest table is probably the matrix element data -- let's say > that there are 3,000 matrices with an average of 100 taxa and 1000 > bases -- that by itself is 300,000,000 records. The next biggest is > probably the edges table -- we have about 5,500 trees, supposing the > average tree has 80 leaves. That comes to about 5500 * ((80 * 2) - 1) > = 875,000 edge records. After that, the taxon_variant table has about > 550,000 records, etc. > > bp > > > > ------------------------------------------------------------------------------ > Register Now & Save for Velocity, the Web Performance & Operations > Conference from O'Reilly Media. Velocity features a full day of > expert-led, hands-on workshops and two days of sessions from industry > leaders in dedicated Performance & Operations tracks. Use code > vel09scf > and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel |
From: Mark D. <mj...@ge...> - 2009-04-29 20:44:19
|
Val Tannen wrote: > MJD had the exact figures a few months ago. > What are they, Mark? I don't have the complete figures right now. But I calculated the values for the matrix element table last week because I thought we were discussing this last week and the figure is about 36GB. The other tables are much smaller. Whatever happened to last week's discussion? Did my messages not get through to the list? |
From: Mark D. <mj...@ge...> - 2009-04-29 21:00:37
|
Mark Dominus wrote: > Val Tannen wrote: >> MJD had the exact figures a few months ago. >> What are they, Mark? > > I don't have the complete figures right now. But I calculated the > values for the matrix element table last week because I thought we were > discussing this last week and the figure is about 36GB. Sorry, my mistake. DB2 reports a table size of 25.8 GB. I will have it produce figures for the other tables. The next-largest is PHYLOTREENODE. All the other tables are reasonable sizes. |
From: Mark D. <mj...@ge...> - 2009-04-30 16:38:50
|
Mark Dominus wrote: > I will have it produce figures for the other tables. Quick summary: 27 GB. ============================================================================== Regenerating the statistics was taking too long, perhaps because of some sort of deadlock. But using the statistics from last week, here is a list of tables whose sizes exceed 1 MB: CHARSET_COLRANGE 216902 5639452 COLUMNRANGE 252922 9611036 DISCRETECHARSTATE 168094 10758016 MATRIXCOLUMN 5630858 310000000 MATRIXELEMENT 300000000 25800000000 MATRIXROW 182396 39032744 PHYLOCHAR 32405 1944300 PHYLOTREE 5603 1933035 PHYLOTREENODE 431761 53970124 SUB_TAXONLABEL 102749 2671474 TAXON 424785 33558016 TAXONLABEL 232262 16722864 TAXONLABELSET_TAXONLABEL 201327 6039810 TAXONVARIANT 548169 62491264 The second column is the number of records; the third is DB2's estimate of the total data size, excluding LOBs. To calculate this, it takes the average row size, adds ten bytes of overhead, and multiplies by the number of records. The row size includes the descriptor for any LOB fields but not the LOB data itself. To estimate LOB data, I got DB2 to give me a list of LOB fields; there were four. Then I did 'select sum(length(FIELD)) from TABLE'. The results are as follows: PHYLOTREE.NEWICKSTRING 6001170 STUDY_NEXUSFILE.NEXUS 333845436 HELP.HELPTEXT 24071 MATRIXROW.SYMBOLSTRING 307037386 Total size estimate exclusive of LOB data: 26 357.582 372 MB Total LOB estimate: 646.908 063 MB TOTAL ESTIMATED SIZE: 27 GB. |
From: Mark D. <mj...@ge...> - 2009-04-29 21:04:26
|
Mark Dominus wrote: > The other tables are much smaller. > PHYLOTREENODE has 431761 records and a table size of 53.9 MB. MATRIXROW has 182396 records and a table size of 39 MB. The TAXON, TAXONLABEL, and TAXONVARIANT tables are big but we agreed today to recreate them at NESCent rather than transferring them from SDSC. The other tables are not too big; I will send a complete report in the morning. |
From: Mark D. <mj...@ge...> - 2009-04-29 20:39:30
|
Rutger Vos wrote: >> Also, will this be a one time operation or ongoing? > > One time-ish. Yes. But Hilmar says: > the deliverable that we want from this is a fully scripted > process that takes the dumps from DB2, does its thing, and at the end > the data is fully imported into the NESCent PostgreSQL instance. The > reason is that we want this to be repeatable. I.e., we will test > first, fix, rerun, etc, until we can switch over in a well-coordinated > fashion that involves only minimal downtime. |
From: Mark D. <mjd...@ge...> - 2009-04-29 20:58:37
|
Rutger Vos wrote: > One possibility is that we use the scripts we've been using to import > TreeBASE1 data into TreeBASE2. Unfortunately, loading the data that > way takes a fair amount of time (think weeks) and human intervention. This is definitely my last choice. > A second possibility would be to write a program that, from NESCent, > connects to the SDSC instance through JDBC and loads the data record > by record. This might take a long time too, I discussed this at some length in my message of 24 April, and I was hoping for a response from Hilmar. Here are my comments from my earlier message: > The drawback of this comes if we need to import the SDSC data a second > time for some reason. It would all have to be transferred over the > network a second time. A dump file need only be transferred once, and > then could be stored at NESCent and loaded as many times as needed. The > benefit would be that there would be no need to worry about escape code > conventions or strange characters or anything like that, and there would > be no need to ship around a bunch of big files. I regret that I wasn't clear about the size of the "big files". The MATRIXELEMENT table has 3e8 records with at least 61 bytes each, which means the dump file is at least 18.4 GB. > it'll depend on the > JDBC connection staying up for that entire time. No; we would write the program to do the loading incrementally. We would have to do that anyway, since we would have the same problem just copying the dump file. We cannot expect to copy a 20GB dump file cross-country in one piece. > To kick off the discussion I'd like to suggest a third possibility: we > implement functionality where each table can be dumped to some > delimited format (CSV, say); Hilmar specifically suggested dumping the data in SQL format: > Ideally we can get DB2 to dump the data as SQL standard-compliant > INSERT statements, CSV is a poor format, and if we are going to write custom dump-and-load programs, I hope we do something better. DB2 will already dump the data into compact "IXF" files; I am still waiting for a reply from Hilmar about whether this will be useful to us. DB2 does not have any other useful built-in dumping capabilities. |