You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
(1) |
Apr
(41) |
May
(41) |
Jun
(50) |
Jul
(14) |
Aug
(21) |
Sep
(37) |
Oct
(8) |
Nov
(4) |
Dec
(135) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(145) |
Feb
(110) |
Mar
(216) |
Apr
(101) |
May
(42) |
Jun
(42) |
Jul
(23) |
Aug
(17) |
Sep
(33) |
Oct
(15) |
Nov
(18) |
Dec
(6) |
2011 |
Jan
(8) |
Feb
(10) |
Mar
(8) |
Apr
(41) |
May
(48) |
Jun
(62) |
Jul
(7) |
Aug
(9) |
Sep
(7) |
Oct
(11) |
Nov
(49) |
Dec
(1) |
2012 |
Jan
(17) |
Feb
(63) |
Mar
(4) |
Apr
(13) |
May
(17) |
Jun
(21) |
Jul
(10) |
Aug
(10) |
Sep
|
Oct
|
Nov
|
Dec
(16) |
2013 |
Jan
(10) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
(5) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(5) |
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
From: Jon A. <jon...@du...> - 2009-04-29 21:36:06
|
Ok, so I'll pick up this thread instead of the one earlier today that hijacked this conversation. I was swamped last week with other 'stuff', so I did not jump in to the conversation. My experience with postgresql to postgresql dumps has been spotty. Postgresql can dump to plain text, tar, or binary format. Sometimes the sequences try to get imported before the table and the whole import fails, in text mode. Some databases import just fine in text mode though. That's why I thought the csv file seemed like a good place to try. It's also the method documented on postgresql's wiki: http://wiki.postgresql.org/wiki/Image:DB2UDB-to-PG.pdf Of course, you would have to be careful to pick a delimiter that does not occur in the data values. I'm not opposed to the web service slurping the data in over the wire, but it seems like more work and more difficult to troubleshoot. Of course, I'm not much of a programmer, so I'm looking at things from a sys admin point of view. That's also a good point about encoding. -Jon ------------------------------------------------------- Jon Auman Systems Administrator National Evolutionary Synthesis Center Duke University http:www.nescent.org jon...@ne... ------------------------------------------------------ On Apr 24, 2009, at 2:21 PM, Mark Dominus wrote: > Hilmar Lapp wrote: >> On Apr 22, 2009, at 4:45 PM, Mark Dominus wrote: >>> I understand there is probably some way to dump the TB2-format >>> data as >>> it currently exists at SDSC, transfer the dump files to NESCent, and >>> bulk-load them into the database on the NESCent side. >> >> In theory yes, but in practice each RDBMS has its own dump format. >> Ideally we can get DB2 to dump the data as SQL standard-compliant >> INSERT statements, but I don't know DB2 enough yet to know whether it >> does that, and aside from that there's more than the data itself, >> such >> as the sequence(s), grants, etc that may not dump in a format that's >> readily ingestible by Pg. > > It dumps the sequences, grants, foreign key constraints, and so forth, > as SQL; see trunk/schema.sql . > > But for dumping the data, it seems as though we can get any format we > want, as long as it is IXF. > > So it then occurred to me that it would not be hard to write a program > that would scan all the records in a table and write out a series of > SQL > INSERT statements. > > But rather than do that, it seems to me that it might make more > sense to > skip the text representation and just write a program that would run > at > NESCent, scan the tables over the network, and execute the appropriate > INSERT statements directly, without ever serializing the data in > between. > > The drawback of this comes if we need to import the SDSC data a second > time for some reason. It would all have to be transferred over the > network a second time. A dump file need only be transferred once, and > then could be stored at NESCent and loaded as many times as needed. > The > benefit would be that there would be no need to worry about escape > code > conventions or strange characters or anything like that, and there > would > be no need to ship around a bunch of big files. > > ------------------------------------------------------------------------------ > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensign option that enables unlimited > royalty-free distribution of the report engine for externally > facing > server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel |
From: Mark D. <mj...@ge...> - 2009-04-29 21:04:26
|
Mark Dominus wrote: > The other tables are much smaller. > PHYLOTREENODE has 431761 records and a table size of 53.9 MB. MATRIXROW has 182396 records and a table size of 39 MB. The TAXON, TAXONLABEL, and TAXONVARIANT tables are big but we agreed today to recreate them at NESCent rather than transferring them from SDSC. The other tables are not too big; I will send a complete report in the morning. |
From: Mark D. <mj...@ge...> - 2009-04-29 21:00:37
|
Mark Dominus wrote: > Val Tannen wrote: >> MJD had the exact figures a few months ago. >> What are they, Mark? > > I don't have the complete figures right now. But I calculated the > values for the matrix element table last week because I thought we were > discussing this last week and the figure is about 36GB. Sorry, my mistake. DB2 reports a table size of 25.8 GB. I will have it produce figures for the other tables. The next-largest is PHYLOTREENODE. All the other tables are reasonable sizes. |
From: Mark D. <mjd...@ge...> - 2009-04-29 20:58:37
|
Rutger Vos wrote: > One possibility is that we use the scripts we've been using to import > TreeBASE1 data into TreeBASE2. Unfortunately, loading the data that > way takes a fair amount of time (think weeks) and human intervention. This is definitely my last choice. > A second possibility would be to write a program that, from NESCent, > connects to the SDSC instance through JDBC and loads the data record > by record. This might take a long time too, I discussed this at some length in my message of 24 April, and I was hoping for a response from Hilmar. Here are my comments from my earlier message: > The drawback of this comes if we need to import the SDSC data a second > time for some reason. It would all have to be transferred over the > network a second time. A dump file need only be transferred once, and > then could be stored at NESCent and loaded as many times as needed. The > benefit would be that there would be no need to worry about escape code > conventions or strange characters or anything like that, and there would > be no need to ship around a bunch of big files. I regret that I wasn't clear about the size of the "big files". The MATRIXELEMENT table has 3e8 records with at least 61 bytes each, which means the dump file is at least 18.4 GB. > it'll depend on the > JDBC connection staying up for that entire time. No; we would write the program to do the loading incrementally. We would have to do that anyway, since we would have the same problem just copying the dump file. We cannot expect to copy a 20GB dump file cross-country in one piece. > To kick off the discussion I'd like to suggest a third possibility: we > implement functionality where each table can be dumped to some > delimited format (CSV, say); Hilmar specifically suggested dumping the data in SQL format: > Ideally we can get DB2 to dump the data as SQL standard-compliant > INSERT statements, CSV is a poor format, and if we are going to write custom dump-and-load programs, I hope we do something better. DB2 will already dump the data into compact "IXF" files; I am still waiting for a reply from Hilmar about whether this will be useful to us. DB2 does not have any other useful built-in dumping capabilities. |
From: Mark D. <mj...@ge...> - 2009-04-29 20:44:19
|
Val Tannen wrote: > MJD had the exact figures a few months ago. > What are they, Mark? I don't have the complete figures right now. But I calculated the values for the matrix element table last week because I thought we were discussing this last week and the figure is about 36GB. The other tables are much smaller. Whatever happened to last week's discussion? Did my messages not get through to the list? |
From: Mark D. <mj...@ge...> - 2009-04-29 20:39:30
|
Rutger Vos wrote: >> Also, will this be a one time operation or ongoing? > > One time-ish. Yes. But Hilmar says: > the deliverable that we want from this is a fully scripted > process that takes the dumps from DB2, does its thing, and at the end > the data is fully imported into the NESCent PostgreSQL instance. The > reason is that we want this to be repeatable. I.e., we will test > first, fix, rerun, etc, until we can switch over in a well-coordinated > fashion that involves only minimal downtime. |
From: Val T. <va...@ci...> - 2009-04-29 20:27:46
|
MJD had the exact figures a few months ago. What are they, Mark? Val On Apr 29, 2009, at 4:04 PM, William Piel wrote: > > On Apr 29, 2009, at 3:01 PM, Rutger Vos wrote: > >>> How big do you think the total dump files will be? For Postgresql, >>> our data >>> dump sizes depend upon the type of data in the database. We've got >>> a 400 MB >>> database with a fair amount of binary data that dumps to a 200 MB >>> file, and >>> we've got a 1 GB database with no binary data that dumps to a 20 MB >>> file. >>> Do you have an idea of the size of the current DB2 database on >>> disk, and >>> what kind of data is in there (test or binary)? >> >> I don't know the exact size of the current database, but it's larger >> than your cases. > > I think DB2's footprint is on the order of 100 GB -- is that right? > > Our largest table is probably the matrix element data -- let's say > that there are 3,000 matrices with an average of 100 taxa and 1000 > bases -- that by itself is 300,000,000 records. The next biggest is > probably the edges table -- we have about 5,500 trees, supposing the > average tree has 80 leaves. That comes to about 5500 * ((80 * 2) - 1) > = 875,000 edge records. After that, the taxon_variant table has about > 550,000 records, etc. > > bp > > > > ------------------------------------------------------------------------------ > Register Now & Save for Velocity, the Web Performance & Operations > Conference from O'Reilly Media. Velocity features a full day of > expert-led, hands-on workshops and two days of sessions from industry > leaders in dedicated Performance & Operations tracks. Use code > vel09scf > and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel |
From: William P. <wil...@ya...> - 2009-04-29 20:04:42
|
On Apr 29, 2009, at 3:01 PM, Rutger Vos wrote: >> How big do you think the total dump files will be? For Postgresql, >> our data >> dump sizes depend upon the type of data in the database. We've got >> a 400 MB >> database with a fair amount of binary data that dumps to a 200 MB >> file, and >> we've got a 1 GB database with no binary data that dumps to a 20 MB >> file. >> Do you have an idea of the size of the current DB2 database on >> disk, and >> what kind of data is in there (test or binary)? > > I don't know the exact size of the current database, but it's larger > than your cases. I think DB2's footprint is on the order of 100 GB -- is that right? Our largest table is probably the matrix element data -- let's say that there are 3,000 matrices with an average of 100 taxa and 1000 bases -- that by itself is 300,000,000 records. The next biggest is probably the edges table -- we have about 5,500 trees, supposing the average tree has 80 leaves. That comes to about 5500 * ((80 * 2) - 1) = 875,000 edge records. After that, the taxon_variant table has about 550,000 records, etc. bp |
From: Rutger V. <rut...@gm...> - 2009-04-29 19:01:27
|
(Forwarding to mailing list.) On Wed, Apr 29, 2009 at 11:48 AM, Jon Auman <jon...@du...> wrote: > As the sysadmin, I've got a preference to #3. > Importing from a csv file in Postgresql is trivial. You also avoid import > aborts caused by postgresql errors and you always know at what stage you are > at during the import. > How big do you think the total dump files will be? For Postgresql, our data > dump sizes depend upon the type of data in the database. We've got a 400 MB > database with a fair amount of binary data that dumps to a 200 MB file, and > we've got a 1 GB database with no binary data that dumps to a 20 MB file. > Do you have an idea of the size of the current DB2 database on disk, and > what kind of data is in there (test or binary)? I don't know the exact size of the current database, but it's larger than your cases. > Also, will this be a one time operation or ongoing? One time-ish. Rutger > On Apr 29, 2009, at 1:51 PM, Rutger Vos wrote: > > Hi all, > > Bill, Mark, Val and I just had a conference call on pending issues > pre-beta. We decided there really aren't any: we're ready to start > beta testing. The topic then moved to what to do after testing. One of > the main issues is how we will move the actual data in TreeBASE2 from > the SDSC database instance (i.e. DB2 sitting on a computer in San > Diego) to the NESCent instance (i.e. PG sitting on a computer in > Durham). > > One possibility is that we use the scripts we've been using to import > TreeBASE1 data into TreeBASE2. Unfortunately, loading the data that > way takes a fair amount of time (think weeks) and human intervention. > > A second possibility would be to write a program that, from NESCent, > connects to the SDSC instance through JDBC and loads the data record > by record. This might take a long time too, and it'll depend on the > JDBC connection staying up for that entire time. > > To kick off the discussion I'd like to suggest a third possibility: we > implement functionality where each table can be dumped to some > delimited format (CSV, say); the dumps are made available for download > as a compressed archive; the NESCent machine downloads that archive > and loads the tables into PG. It seems to me that we want the > dump+zip+serve up functionality anyway, so this would be a good way to > make that happen. > > Any thoughts? > > Thanks, > > Rutger > > -- > Dr. Rutger A. Vos > Department of zoology > University of British Columbia > http://www.nexml.org > http://rutgervos.blogspot.com > > ------------------------------------------------------------------------------ > Register Now & Save for Velocity, the Web Performance & Operations > Conference from O'Reilly Media. Velocity features a full day of > expert-led, hands-on workshops and two days of sessions from industry > leaders in dedicated Performance & Operations tracks. Use code vel09scf > and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > > -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |
From: Rutger V. <rut...@gm...> - 2009-04-29 17:51:52
|
Hi all, Bill, Mark, Val and I just had a conference call on pending issues pre-beta. We decided there really aren't any: we're ready to start beta testing. The topic then moved to what to do after testing. One of the main issues is how we will move the actual data in TreeBASE2 from the SDSC database instance (i.e. DB2 sitting on a computer in San Diego) to the NESCent instance (i.e. PG sitting on a computer in Durham). One possibility is that we use the scripts we've been using to import TreeBASE1 data into TreeBASE2. Unfortunately, loading the data that way takes a fair amount of time (think weeks) and human intervention. A second possibility would be to write a program that, from NESCent, connects to the SDSC instance through JDBC and loads the data record by record. This might take a long time too, and it'll depend on the JDBC connection staying up for that entire time. To kick off the discussion I'd like to suggest a third possibility: we implement functionality where each table can be dumped to some delimited format (CSV, say); the dumps are made available for download as a compressed archive; the NESCent machine downloads that archive and loads the tables into PG. It seems to me that we want the dump+zip+serve up functionality anyway, so this would be a good way to make that happen. Any thoughts? Thanks, Rutger -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |
From: Rutger V. <rut...@gm...> - 2009-04-24 20:01:39
|
It would be very useful if we did have a dump format (I imagined something simple like csv or some other delimiter). Some databases offer this for downloads (e.g. ncbi taxonomy, itis, "mammal species of the world") and it's a very useful feature. If we want treebase to be more than a place where trees go to die, this would be one way to facilitate meta-analyses and such. DB2::Admin on cpan (http://search.cpan.org/dist/DB2-Admin/) has a facility to dump DB2 tables as delimited files, so we could write a cron job script to do just that and make the output available as an archive. On Fri, Apr 24, 2009 at 11:21 AM, Mark Dominus <mjd...@ge...> wrote: > Hilmar Lapp wrote: >> On Apr 22, 2009, at 4:45 PM, Mark Dominus wrote: >>> I understand there is probably some way to dump the TB2-format data as >>> it currently exists at SDSC, transfer the dump files to NESCent, and >>> bulk-load them into the database on the NESCent side. >> >> In theory yes, but in practice each RDBMS has its own dump format. >> Ideally we can get DB2 to dump the data as SQL standard-compliant >> INSERT statements, but I don't know DB2 enough yet to know whether it >> does that, and aside from that there's more than the data itself, such >> as the sequence(s), grants, etc that may not dump in a format that's >> readily ingestible by Pg. > > It dumps the sequences, grants, foreign key constraints, and so forth, > as SQL; see trunk/schema.sql . > > But for dumping the data, it seems as though we can get any format we > want, as long as it is IXF. > > So it then occurred to me that it would not be hard to write a program > that would scan all the records in a table and write out a series of SQL > INSERT statements. > > But rather than do that, it seems to me that it might make more sense to > skip the text representation and just write a program that would run at > NESCent, scan the tables over the network, and execute the appropriate > INSERT statements directly, without ever serializing the data in between. > > The drawback of this comes if we need to import the SDSC data a second > time for some reason. It would all have to be transferred over the > network a second time. A dump file need only be transferred once, and > then could be stored at NESCent and loaded as many times as needed. The > benefit would be that there would be no need to worry about escape code > conventions or strange characters or anything like that, and there would > be no need to ship around a bunch of big files. > > ------------------------------------------------------------------------------ > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensign option that enables unlimited > royalty-free distribution of the report engine for externally facing > server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |
From: Mark D. <mjd...@ge...> - 2009-04-24 18:22:01
|
Hilmar Lapp wrote: > On Apr 22, 2009, at 4:45 PM, Mark Dominus wrote: >> I understand there is probably some way to dump the TB2-format data as >> it currently exists at SDSC, transfer the dump files to NESCent, and >> bulk-load them into the database on the NESCent side. > > In theory yes, but in practice each RDBMS has its own dump format. > Ideally we can get DB2 to dump the data as SQL standard-compliant > INSERT statements, but I don't know DB2 enough yet to know whether it > does that, and aside from that there's more than the data itself, such > as the sequence(s), grants, etc that may not dump in a format that's > readily ingestible by Pg. It dumps the sequences, grants, foreign key constraints, and so forth, as SQL; see trunk/schema.sql . But for dumping the data, it seems as though we can get any format we want, as long as it is IXF. So it then occurred to me that it would not be hard to write a program that would scan all the records in a table and write out a series of SQL INSERT statements. But rather than do that, it seems to me that it might make more sense to skip the text representation and just write a program that would run at NESCent, scan the tables over the network, and execute the appropriate INSERT statements directly, without ever serializing the data in between. The drawback of this comes if we need to import the SDSC data a second time for some reason. It would all have to be transferred over the network a second time. A dump file need only be transferred once, and then could be stored at NESCent and loaded as many times as needed. The benefit would be that there would be no need to worry about escape code conventions or strange characters or anything like that, and there would be no need to ship around a bunch of big files. |
From: Hilmar L. <hl...@du...> - 2009-04-23 21:14:04
|
On Apr 23, 2009, at 4:58 PM, Rutger Vos wrote: >> It's not a good way to manage schema versions, though. I.e., there >> would be none, and accordingly, you then can't have software that >> uses >> the database directly rather than through Hibernate and still relies >> on anything, and you can't make any direct modifications to the >> schema. > > So how do we make the jump to a situation where we have a versioned > schema? The way I imagined it would be to let hibernate generate a > pg-compatible schema file, run it on a fresh pg instance and stick the > file in the repository. That's a possibility. It would require that the webapp code actually runs w/o error to that point. Another possibility is to take the schema dump from DB2 and convert from there. The schema dump should be easy to obtain, so that's where I would start. If the result looks like a lot of work, I would then try the hibernate route. > Now we've made the jump. Then - what happens if we need to alter > one table (e.g. we need to make the "abstract" field of "article" > longer)? We run an "alter table'" command an paste the command at > the bottom of the schema file? Or is this a point where we change > the original "create" command for the "article" table and > rerun the whole script (which might involve dropping and reloading the > whole database)? Yes and no. Yes, you do change the original CREATE command (so that your master script for instantiating the database from scratch stays up-to-date). To migrate existing database instances, you write a script that applies all necessary changes from the previous release to the next, without dropping or losing data. There are examples for how this looks like in BioSQL: http://tinyurl.com/cxwz6f http://tinyurl.com/csx2mj -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== |
From: Rutger V. <rut...@gm...> - 2009-04-23 20:58:21
|
> It's not a good way to manage schema versions, though. I.e., there > would be none, and accordingly, you then can't have software that uses > the database directly rather than through Hibernate and still relies > on anything, and you can't make any direct modifications to the schema. So how do we make the jump to a situation where we have a versioned schema? The way I imagined it would be to let hibernate generate a pg-compatible schema file, run it on a fresh pg instance and stick the file in the repository. Now we've made the jump. Then - what happens if we need to alter one table (e.g. we need to make the "abstract" field of "article" longer)? We run an "alter table'" command an paste the command at the bottom of the schema file? Or is this a point where we change the original "create" command for the "article" table and rerun the whole script (which might involve dropping and reloading the whole database)? |
From: Mark D. <mjd...@ge...> - 2009-04-23 19:45:44
|
Hilmar Lapp wrote: >> I understand there is probably some way to dump the TB2-format data as >> it currently exists at SDSC, transfer the dump files to NESCent, and >> bulk-load them into the database on the NESCent side. > > In theory yes, but in practice each RDBMS has its own dump format. > Ideally we can get DB2 to dump the data as SQL standard-compliant > INSERT statements, but I don't know DB2 enough yet to know whether it > does that, and aside from that there's more than the data itself, such > as the sequence(s), grants, etc that may not dump in a format that's > readily ingestible by Pg. It appears that DB2 will dump the data in only one useful format, called IXF. Do you know if Pg can import that? if not we may have to do something. Also, although most of the database is pretty small, there is one table with around 3e8 rows, and it may not be practical to export this to a file or to transport the file. |
From: Mark D. <mjd...@ge...> - 2009-04-23 18:44:36
|
Hilmar Lapp wrote: > > On Apr 23, 2009, at 12:39 PM, Mark Dominus wrote: >> This we do have. > > > The semantic data integrity tests or the confidence? :) I'm assuming you > meant the former, Yes. |
From: Hilmar L. <hl...@du...> - 2009-04-23 18:04:28
|
On Apr 23, 2009, at 12:39 PM, Mark Dominus wrote: >> and semantic data integrity tests (i.e., tests for integrity that >> go beyond >> the constraints that the database enforces) so that we can have >> confidence in the migration result. > > This we do have. The semantic data integrity tests or the confidence? :) I'm assuming you meant the former, as we haven't migrated to Pg yet. Or were you saying you have confidence in the TB1->TB2 migration? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== |
From: Mark D. <mjd...@ge...> - 2009-04-23 17:07:50
|
Hilmar Lapp wrote: > Can you not create a separate (and possibly temporary) table that > records the entity name, the primary key, and whatever mark you want > to place on the record identified by the key and the entity name? That's a better idea, thanks. |
From: Mark D. <mjd...@ge...> - 2009-04-23 16:40:21
|
Hilmar Lapp wrote: >>> 1d. unit testing of TB2 code >> >> The current codebase is severely lacking in unit tests. What tests >> there are are often extremely slow and are more properly system tests >> than unit tests. > > I think it's going to be *very* important to have unit tests, I agree completely. I was just warning you what to expect. > and semantic data integrity tests (i.e., tests for integrity that go beyond > the constraints that the database enforces) so that we can have > confidence in the migration result. This we do have. |
From: Mark D. <mjd...@ge...> - 2009-04-23 16:38:33
|
Hilmar Lapp wrote: > For software projects such as TreeBASE it is best practice to manage > versioning of the schema as you manage versioning of the software, and > to change a schema by writing and applying a migration script. Okay, that makes sense to me. Thanks. > > The deliverable that we want from this is a fully scripted process > that takes the dumps from DB2, does its thing, and at the end the data > is fully imported into the NESCent PostgreSQL instance. The reason is > that we want this to be repeatable. I.e., we will test first, fix, > rerun, etc, until we can switch over in a well-coordinated fashion that > involves only minimal downtime. Very good. What do you think I should do first? > >> [...] I would not like to perform the TB1->TB2 migration a second >> time, if at >> all possible. > > That would be a bad idea indeed once data have been added to TB2. I think we were hoping to move from SDSC to NESCent before TB2 opened for business. That is, we have an installation at SDSC now which is suitable for beta-testing, but I think we planned to discard all data uploaded by beta-testers before moving the data to NESCent. |
From: Hilmar L. <hl...@du...> - 2009-04-23 15:34:51
|
On Apr 22, 2009, at 4:45 PM, Mark Dominus wrote: > [...] > Hibernate will generate the database schema definition and create the > tables if we ask it do I know, and in fact that's one possible way to migrate the schema from DB2 to PostgreSQL. It's not a good way to manage schema versions, though. I.e., there would be none, and accordingly, you then can't have software that uses the database directly rather than through Hibernate and still relies on anything, and you can't make any direct modifications to the schema. For software projects such as TreeBASE it is best practice to manage versioning of the schema as you manage versioning of the software, and to change a schema by writing and applying a migration script. > [...] > I understand there is probably some way to dump the TB2-format data as > it currently exists at SDSC, transfer the dump files to NESCent, and > bulk-load them into the database on the NESCent side. In theory yes, but in practice each RDBMS has its own dump format. Ideally we can get DB2 to dump the data as SQL standard-compliant INSERT statements, but I don't know DB2 enough yet to know whether it does that, and aside from that there's more than the data itself, such as the sequence(s), grants, etc that may not dump in a format that's readily ingestible by Pg. Also, the deliverable that we want from this is a fully scripted process that takes the dumps from DB2, does its thing, and at the end the data is fully imported into the NESCent PostgreSQL instance. The reason is that we want this to be repeatable. I.e., we will test first, fix, rerun, etc, until we can switch over in a well-coordinated fashion that involves only minimal downtime. > [...] I would not like to perform the TB1->TB2 migration a second > time, if at > all possible. That would be a bad idea indeed once data have been added to TB2. >> 1d. unit testing of TB2 code > > The current codebase is severely lacking in unit tests. What tests > there are are often extremely slow and are more properly system tests > than unit tests. I think it's going to be *very* important to have unit tests, and semantic data integrity tests (i.e., tests for integrity that go beyond the constraints that the database enforces) so that we can have confidence in the migration result. Otherwise we'll be sort of betting. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== |
From: Hilmar L. <hl...@du...> - 2009-04-23 14:56:07
|
I agree you want to store that in the database but I'm not sure why in order to do that you have to clutter the data model with properties (columns) that are solely needed for data migration rather than for genuinely representing the data. Can you not create a separate (and possibly temporary) table that records the entity name, the primary key, and whatever mark you want to place on the record identified by the key and the entity name? Something along those lines would probably be what I would try. I think it's useful to adopt the same separation of concerns that are commonly accepted for managing versus using a computer system. Pretend you can't modify the data model and that only a DBA has the privilege to do so, and the DBA's knee-jerk reaction to everything that introduces change, subsequent maintenance, idiosyncrasy, and risk, is 'no'. Would that mean that then you can't do what you want to do? Probably not, you would just do in a non-invasive way. Which, I would argue, is in most cases the better way. -hilmar On Apr 22, 2009, at 4:57 PM, Mark Dominus wrote: > During the development of TB2 there have been several times when I > wished that each database table had an extra scratch column. > > While checking and correcting the migration from TB1, I often found > myself maintaining and merging lists of object ID numbers. For > example, > I might have a list of objects which needed to be checked, another > list of objects which were found to be erroneous, and so on. Rather > than maintaining a bunch of text files with object ID numbers, it > would > have been much more convenient to be able to store this information in > the database itself. > > Or another example: I have written a garbage-collection utility that > identifies and removes inaccessible objects from the database. The > easiest way to do this is with a mark-and-sweep algorithm: first mark > all the known reachable objects, then all the objects reachable from > those, and so on; then make a second pass and delete all the unmarked > objects. But to implement this requires storing the mark > somewhere. At > present, the utility program keeps an external record of which objects > have been marked. Storing the marks in the database would be much > simpler. > > It would require a small change to the code to add an additional field > to every persistent object table to enable this sort of use. The > obvious time to do it would be at the time the tables are created at > NESCent. > > The obvious drawback is that it takes up storage. (There are 3e08 > matrix element objects.) > > Is this a bad idea for some non-obvious reason? > > > > ------------------------------------------------------------------------------ > Stay on top of everything new and different, both inside and > around Java (TM) technology - register by April 22, and save > $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. > 300 plus technical and hands-on sessions. Register today. > Use priority code J9JMT32. http://p.sf.net/sfu/p > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== |
From: Mark D. <mjd...@ge...> - 2009-04-23 03:12:17
|
From: Mark D. <mj...@ge...> - 2009-04-22 20:58:01
|
During the development of TB2 there have been several times when I wished that each database table had an extra scratch column. While checking and correcting the migration from TB1, I often found myself maintaining and merging lists of object ID numbers. For example, I might have a list of objects which needed to be checked, another list of objects which were found to be erroneous, and so on. Rather than maintaining a bunch of text files with object ID numbers, it would have been much more convenient to be able to store this information in the database itself. Or another example: I have written a garbage-collection utility that identifies and removes inaccessible objects from the database. The easiest way to do this is with a mark-and-sweep algorithm: first mark all the known reachable objects, then all the objects reachable from those, and so on; then make a second pass and delete all the unmarked objects. But to implement this requires storing the mark somewhere. At present, the utility program keeps an external record of which objects have been marked. Storing the marks in the database would be much simpler. It would require a small change to the code to add an additional field to every persistent object table to enable this sort of use. The obvious time to do it would be at the time the tables are created at NESCent. The obvious drawback is that it takes up storage. (There are 3e08 matrix element objects.) Is this a bad idea for some non-obvious reason? |
From: Mark D. <mj...@ge...> - 2009-04-22 20:57:52
|
Hilmar Lapp wrote: > 1b. database schema definition > 4. Schema migration Hibernate will generate the database schema definition and create the tables if we ask it do; we have done this a couple of times at SDSC. Or did I misunderstand this point? > 1c. data migration from TB1 and testing of result > 5. Data migration script Right now the TB1 data is nearly all installed in the new TB2 database at SDSC. Since the SDSC and NESCent scemas will be identical, or nearly so, the data migration from SDSC to NESCent should be straightforward. I understand there is probably some way to dump the TB2-format data as it currently exists at SDSC, transfer the dump files to NESCent, and bulk-load them into the database on the NESCent side. I would not like to perform the TB1->TB2 migration a second time, if at all possible. > 1d. unit testing of TB2 code The current codebase is severely lacking in unit tests. What tests there are are often extremely slow and are more properly system tests than unit tests. |