You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
(1) |
Apr
(41) |
May
(41) |
Jun
(50) |
Jul
(14) |
Aug
(21) |
Sep
(37) |
Oct
(8) |
Nov
(4) |
Dec
(135) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(145) |
Feb
(110) |
Mar
(216) |
Apr
(101) |
May
(42) |
Jun
(42) |
Jul
(23) |
Aug
(17) |
Sep
(33) |
Oct
(15) |
Nov
(18) |
Dec
(6) |
2011 |
Jan
(8) |
Feb
(10) |
Mar
(8) |
Apr
(41) |
May
(48) |
Jun
(62) |
Jul
(7) |
Aug
(9) |
Sep
(7) |
Oct
(11) |
Nov
(49) |
Dec
(1) |
2012 |
Jan
(17) |
Feb
(63) |
Mar
(4) |
Apr
(13) |
May
(17) |
Jun
(21) |
Jul
(10) |
Aug
(10) |
Sep
|
Oct
|
Nov
|
Dec
(16) |
2013 |
Jan
(10) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
(5) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(5) |
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
From: Vladimir G. <vla...@du...> - 2010-03-09 22:33:39
|
On Mar 5, 2010, at 2:14 PM, Vladimir Gapeyev wrote: > > To keep track of any possible data cleaning activity, Hilmar was > suggesting to set up a directory in SVN -- I will soon follow up with > a proposal. Besides this, I will now turn to documenting and > depositing migration scripts, etc., until I hear what else needs to be > done to finalize the migration. I have now set up a place to keep track of any data cleaning tasks. It is in treebase-core/db/cleaning. I have already added there a task with a script from Bill applied during the 1st batch of migration last week. --Vladimir |
From: Hilmar L. <hl...@ne...> - 2010-03-09 18:21:59
|
On Mar 9, 2010, at 12:26 PM, Vladimir Gapeyev wrote: > If these are minor in the large scheme of things, that's ok with me. Some of these are minor, and some of these don't seem different to me from data maintenance tasks that I expect to continue to come up, so we're not ridding ourselves of the need to run such things with due diligence (full use of transactions, prior testing on a separate instance, etc) by avoiding this one. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== |
From: Vladimir G. <vla...@du...> - 2010-03-09 17:43:56
|
On Mar 9, 2010, at 12:05 PM, William Piel wrote: > I thought this was the "normal" way that the migration proceeds -- > i.e. with the following 3 stages: > > 1. parse and upload the dump.txt file and associated trees and > matrices. TB1 has all citations in one line -- this is crammed in > the title field, although full names and email addresses of authors > are stored in separate tables but without info on author order. > > 2. replace the existing taxon_variant and taxon tables with the > latest TI dump. > > 3. update the citation information with the latest Endnote file. > Here, author names are abbreviated (per Endnote conventions) but > author order is known. > > Is this the same basic order of tasks which you used for the Dec09 > migration? That's correct. > The only difference here is that we can go live before task 3 is > performed. And note that we could task our undergrads with editing > the citation info directly with the TreeBASE2 interface instead of > first editing an Endnote file. i.e., we actually do away with step 3 > as it stands. (Although seeing as our metadata student help (in > Endnote) has improved metadata for all TreeBASE studies > considerably, we'll probably want to run a citation update script at > some point anyway). -- See the other reply. > Do we need to think about how we will run update scripts and data > cleansing scripts in future? I presume we should. > I'm not sure what's the point of "stage" other than the ability to > test new builds against a (slightly older) version of the production > data. For update and data cleansing scripts, we will need to apply > them directly to production (after first triggering a pg_dump, of > course). That's the plan for the post-release life. At the moment, we use the two instances as two alternative containers for the main data set: one is read-only (conceptually) and the other is the working copy, with these roles flipping back and forth depending whether a migration is in progress. --VG |
From: Vladimir G. <vla...@du...> - 2010-03-09 17:26:34
|
On Mar 9, 2010, at 11:24 AM, Hilmar Lapp wrote: > > On Mar 9, 2010, at 10:21 AM, Vladimir Gapeyev wrote: > >> (4) Some bibliography entries in the release will have all info >> crammed into the title field. This will be fixed after the release. >> >> I am worried about the logistics of (4), and would prefer to have it >> done prior to the release. > > > What are your main worries? Also, assuming that this would need to > be done manually (would it?), do we have an estimate of how long it > would take? > If we do this after the release, what are the risks, or would the > effort increase significantly? (E.g., might this work cause clashes > with concurrent submissions?) The operation is scripted, but it will have to be done on the live production instance, so the worries come from that: malfunction or downtime of the production site if anything goes wrong with this operation. Also, is it known that the web front end operates reasonably when all biblio info is in the title field and the other fields are null? If these are minor in the large scheme of things, that's ok with me. --VG |
From: William P. <wil...@ya...> - 2010-03-09 17:05:50
|
On Mar 9, 2010, at 10:21 AM, Vladimir Gapeyev wrote: > (4) Some bibliography entries in the release will have all info > crammed into the title field. This will be fixed after the release. > > I am worried about the logistics of (4), and would prefer to have it > done prior to the release. I thought this was the "normal" way that the migration proceeds -- i.e. with the following 3 stages: 1. parse and upload the dump.txt file and associated trees and matrices. TB1 has all citations in one line -- this is crammed in the title field, although full names and email addresses of authors are stored in separate tables but without info on author order. 2. replace the existing taxon_variant and taxon tables with the latest TI dump. 3. update the citation information with the latest Endnote file. Here, author names are abbreviated (per Endnote conventions) but author order is known. Is this the same basic order of tasks which you used for the Dec09 migration? The only difference here is that we can go live before task 3 is performed. And note that we could task our undergrads with editing the citation info directly with the TreeBASE2 interface instead of first editing an Endnote file. i.e., we actually do away with step 3 as it stands. (Although seeing as our metadata student help (in Endnote) has improved metadata for all TreeBASE studies considerably, we'll probably want to run a citation update script at some point anyway). Do we need to think about how we will run update scripts and data cleansing scripts in future? I'm not sure what's the point of "stage" other than the ability to test new builds against a (slightly older) version of the production data. For update and data cleansing scripts, we will need to apply them directly to production (after first triggering a pg_dump, of course). bp |
From: Hilmar L. <hl...@ne...> - 2010-03-09 16:24:20
|
On Mar 9, 2010, at 10:21 AM, Vladimir Gapeyev wrote: > (4) Some bibliography entries in the release will have all info > crammed into the title field. This will be fixed after the release. > > I am worried about the logistics of (4), and would prefer to have it > done prior to the release. What are your main worries? Also, assuming that this would need to be done manually (would it?), do we have an estimate of how long it would take? If we do this after the release, what are the risks, or would the effort increase significantly? (E.g., might this work cause clashes with concurrent submissions?) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== |
From: William P. <wil...@ya...> - 2010-03-09 15:23:31
|
I just calculated that since early January, TB1 has acquired 6,603 distinct taxon labels that are new to the database. I'm working on mapping them now. bp |
From: Vladimir G. <vla...@du...> - 2010-03-09 15:22:07
|
On Mar 8, 2010, at 9:28 PM, William Piel wrote: > At any rate, we have the last migration dump here: > > http://treebase.peabody.yale.edu/treebase/migration/Mar-10/ > > The citations and TI are not there yet, but Vladimir can start > migrating the trees, characters, and the dump file. Meanwhile I'll > be working on the TI mapping -- that's critical for release. For > now, citations can be stored in title field as one-liners (this is > not critical for release, so I'll be working on them after the TI is > done). Ok, I take it as this: (1) The post-migration instance at treebase-stage is currently being used for informational purposes only: no data changes that need to be preserved will be done to this instance. (2) I should now start loading the final data batch into the production instance. (3) After the final load is done to production, we will again replicate it to staging, where we will do manual pre-release data cleaning. (4) Some bibliography entries in the release will have all info crammed into the title field. This will be fixed after the release. I am worried about the logistics of (4), and would prefer to have it done prior to the release. (As a reminder, the hassle with copying back and forth between the production and staging instances is because the production server is significantly faster and loading of matrices and trees takes awhile.) |
From: William P. <wil...@ya...> - 2010-03-09 02:28:57
|
On Mar 8, 2010, at 6:37 PM, Hilmar Lapp wrote: > I've sent it to you, Vladimir, and Youjun. You actually responded to it. It's below FYI. Ah -- yes, I see. I thought you were referring to a more recent communication. Since the migration was not "going smoothly" I ended waiting for more hopeful signs. At any rate, we have the last migration dump here: http://treebase.peabody.yale.edu/treebase/migration/Mar-10/ The citations and TI are not there yet, but Vladimir can start migrating the trees, characters, and the dump file. Meanwhile I'll be working on the TI mapping -- that's critical for release. For now, citations can be stored in title field as one-liners (this is not critical for release, so I'll be working on them after the TI is done). It's the last sprint-to-the-finish! bp |
From: William P. <wil...@ya...> - 2010-03-08 05:40:08
|
Hi all, I'm catching up on things, having been away this weekend. On Mar 5, 2010, at 2:14 PM, Vladimir Gapeyev wrote: > A copy of the migrated data is now in the treebasestage DB instance. Fabulous! On Mar 6, 2010, at 5:06 PM, Hilmar Lapp wrote: > Hi Bill, > > as I emailed you earlier, the release plan put the date for shutting down TB1 on last Wed, followed by creating the dumps for the final migration increment. Vladimir tells me that he does not have those dumps. Can you update me where that stands? > > -hilmar I've searched everywhere, but I can't find that email. Did you send it to tre...@li... ? At any rate, with this 2009 migration done I'll be shutting down new submissions to TB1 on Monday. I have a number of ready submissions still to process, but then it's the final dump. Meanwhile, is there a system or strategy for testing the 2009 migrated data -- or just manually picking out some studies at random to test? While TB1 submissions are shut down, the final handful of level-9 bugs can be addressed. bp |
From: Vladimir G. <vla...@du...> - 2010-03-05 19:22:46
|
On Mar 5, 2010, at 10:29 AM, William Piel wrote: > > On Mar 5, 2010, at 10:22 AM, youjun guo wrote: > >> Dear All, >> >> There is a new treebase unit test failure appeared recently. It due >> to "submissions do not have a related submitter". >> >> Following query result from the treebase-db may give you a better >> idea: >> >> select count(*) from submission where user_id is null : 2044 >> >> Youjun > > > I think the renewal of data for stage/production has brought back > this user_id IS NULL problem. After we're done with migrations, we > can reassign all TB1 studies to a new user_id. This must have been caused by the treebase-dev restore from an older backup earlier this week. I'd suggest to ignore this test on treebase-dev for now; resolve the problem on treebase-stage, as part of other data cleaning; and then replicate the data to treebase-dev and try the tests again. --VG |
From: Vladimir G. <vga...@ne...> - 2010-03-05 19:14:18
|
A copy of the migrated data is now in the treebasestage DB instance. The http://treebase-stage.nescent.org front end should be up shortly. We have "named" backups for the pre- and post- migration copies. Also, the treebase-dev instance is reasonably close to the pre-migration copy, as it was restored earlier this week from a production backup. I suggest that the staging DB instance be used for any necessary inspecting and cleaning -- all interested parties should have suitable access rights. We'll then migrate the Dec-March delta into this modified instance (or to its copy on production). We should probably discuss (recall?) what are the next steps. To keep track of any possible data cleaning activity, Hilmar was suggesting to set up a directory in SVN -- I will soon follow up with a proposal. Besides this, I will now turn to documenting and depositing migration scripts, etc., until I hear what else needs to be done to finalize the migration. I am aware of a few issues with the post-migration data that may warrant further attention: (1) Matrix loading broke a couple times, running out of heap space (there is probably a memory leak). This left the matrices on which it broke in inconsistent state. I scraped out all (I hope) data associated with these matrices and loaded them again. Still, this was a deviation from the straight path. The matrices were Filename matrix_id(scraped) matrix_id(reloaded) M4374.nex 4678 4718 M4622.nex 4210 4717 (2) The citations.txt contained 4 "Book Section" entries, which broke the citation import tool. Since these were only a handful, instead of tracking down the problem I decided to skip these entries (I commented them out). Hopefully, it will be possible to enter them by hand later. FYI, the 4 entries are appended below. (3) As we previously noted with Bill, it appeared that the migration process picked up a few more matrix and tree files than Bill expected were new ones in the delta. Here are the orphaned trees and matrices that I could detect after the migration. Compared to the list of matrices with null study_id in treebase-dev (which is almost identical to the pre-migration copy), the post- migration copy has 4 more such matrices: matrix_id | nexusfilename 4227 | M4864.nex 4280 | M4470.nex 4456 | M4886.nex 4528 | M4863.nex Trees that belong to the fake study (#2264) used by the migration process for initial uploading (and do not belong to this study in treebase-dev): phylotree_id | nexusfilename --------------+----------------- 6074 | S1934A11000.tre 6075 | S1934A11000.tre 6176 | S1815A10024.tre 6358 | S1934A11001.tre 6433 | S1319A11058.tre 6521 | S1319A11057.tre I should mention that there are about 26 more matrices and 52 more trees with the null study_id, coming from the pre-migration instance. --Vladimir === These citations were skipped during the migration ==== *Book Section Columbus, J. T.; Peterson, P. M.; Refulio Rodriguez, N. F.; Cerros Tlatilpa, R.; Kinney, M. S. 2009 Phylogenetics of Muhlenbergiinae (Poaceae, Chloridoideae, Cynodonteae) based \ on ITS and trnL-F DNA sequences Seberg, O.; Petersen, G.; Barfod, A. S.; Davis, J. I. Proceedings of the Fourth International Conference on the Comparative Biology of the Monocotyledons and Th\ e Fifth International Symposium on Grass Systematics and Evolution in press Muhlenbergiinae are a subtribe \ in the grass (Poaceae) subfamily Chloridoideae, tribe Cynodonteae. The morphologically diverse group includes ten genera and 168 species and is restricted almost entirely to the New World, with a c\ enter of diversity in Mexico (125 species). With 147 species, Muhlenbergia is by far the largest genus, and is divided into two subgenera, Muhlenbergia and Trichochloa, the latter with two sections\ . The other, much smaller genera are Aegopogon (4 species), Bealia (1), Blepharoneuron (2), Chaboissaea (4), Lycurus (3), Pereilema (4), Redfieldia (1), Schaffnerella (1), and Schedonnardus (1). We\ conducted a phylogenetic study of Muhlenbergiinae based on parsimony analysis of DNA sequences of the nuclear ribosomal internal transcribed spacer region (ITS1 + 5.8S + ITS2) and chloroplast trnL\ intron, trnL 3′ exon, and trnL trnF intergenic spacer. All genera were sampled, including 52 species of Muhlenbergia representing both subgenera and sections. Muhlenbergia and Pereilema are \ not monophyletic in the resulting trees. The species of Pereilema and the other small genera are nested within Muhlenbergia in three main lineages. One of the lineages includes a monophyletic Muhle\ nbergia subgen. Trichochloa. Another lineage comprises species having leaf anatomy predictive of the PCK subtype of C4 photosynthesis. Based on the results of this study, we favor expanding the cir\ cumscription of Muhlenbergia to include the other nine genera of the subtribe. S2438 *Book Section Duvall, M. R.; Leseberg, C. H.; Grennan, C. P.; Morris, L. M. 2009 Molecular evolution and phylogenetics of complete chloroplast genomes in Poaceae Davis, J. Fifth\ International Symposium on Grass Systematics and Evolution in press Phylogenetic issues in Poaceae not\ resolved by previous multi-gene analyses can be usefully investigated by small genome-scale analyses. In this pilot study, complete or nearly complete chloroplast genomes (plastomes) were sequence\ d from six selected graminoids. Representatives of Anomochlooideae, Puelioideae, Bambusoideae, both major tribes of Panicoideae and Joinvilleaceae were newly sampled to supplement previously publis\ hed plastome data from Ehrhartoideae, Pooideae, and Andropogoneae. For amplification and sequencing, over 200 pairs of primers were designed in conserved regions of published grass plastomes that w\ ere positioned to flank overlapping 1200-base pair fragments around the entire plastome. As expected, gene order and number were highly conserved. Concurrent with the high conservation of the plast\ ome was considerable cumulative variation useful for studies within the family and even within a single tribe. Readily interpreted mutational patterns were observed, such as small inversions of the\ loop in hairpin-loop regions and indels resulting from slipped- strand mispairings. Phylogenetic analyses were conducted on these and eight previously published plastomes. Maximum or near-maximum s\ upport was observed in all likelihood and parsimony bootstrap analyses including shallow nodes, such as those within a clade corresponding to a complex of four Andropogoneae, and deep nodes, such a\ s the one uniting the bambusoid/ehrhartoid/pooid (BEP) clade. S2429 *Book Section Prince, L. 2009 Phylogenetic relationships and species delimitation in Canna (Cannaceae) Seberg, O.; Petersen, G.; Barfod, A. S.; Davis, J. Proceedings of the Fo\ urth International Conference on the Comparative Biology of the Monocotyledons and The Fifth International Symposium on Grass Systematics and Evolution \ Canna lilies are a conspicuous component of tropical and subtropical humid Neotropics where they are native, and the Asian Paleotropics where they have\ been introduced. Cannas have been cultivated as a food item (rhizome), for wrapping (leaves), and as beads (seeds) for millennia by indigenous people. In both tropical and temperate regions they h\ ave a long history as ornamental plants as well. With only a few dozen taxa in a single genus, Cannaceae has much lower generic and species diversity than its sister family, Marantaceae (550 specie\ s in 31 genera). Parsimony and Bayesian analyses of nuclear ribosomal internal transcribed spacer (ITS) and chloroplast non-coding sequence data (trnE-T intergenic spacer and rpL16 intron) were use\ d to infer evolutionary relationships among species. Potential causes of non-monophyly of nuclear ITS haplotypes and conflict between nuclear and plastid phylogenies for some samples are discussed.\ Chloroplast (rbcL, ndhF) DNA data indicate a North American taxon, Canna flaccida, is sister to all other species in the genus. Phylogenetic analyses are consistent with the hypothesis of a South \ American origin for the genus, followed by dispersal and migration to North and Central America, and the Caribbean. S2373 *Book Section Roncal, J.; Borchsenius, F.; Asmussen-Lange, C. B.; Balslev, H. 2009 Divergence times in tribe Geonomateae (Arecaceae) coincide with tertiary geological events. The Comparative B\ iology of the Monocotyledons Proceedings of the Fourth International Conference Aarhus University Press in press \ The Geonomateae is a species-rich palm tribe restricted to the Neotropics with a concentration of species in western Colombia and adjacent Central America with extensions along the An\ des. We estimated divergence times for the Geonomateae based on a phylogeny resulting from analysis of two low-copy nuclear DNA genes and using a Bayesian relaxed molecular clock method. We obtaine\ d calibration points from the fossil record and previous dated phylogenies in the Arecaceae. The results indicated a diversification of the tribe during the Oligocene at around 31 million years ago\ . Divergence time of a high elevation Geonoma clade from 3.8 to 9.2 million years ago coincided with the Andean uplift. A clade of Geonoma species from the Brazilian Shield was contemporary with Mi\ ocene marine incursions in South America. The most likely scenario to explain the arrival of the Calyptronoma-Calyptrogyne ancestor in the Greater Antilles is a migration through a dry-land connect\ ion between Central or South America prior to the formation of the Panamanian isthmus and after 27.8 million years ago. The molecular dating results were consistent with the growing evidence of a T\ ertiary diversification for most Neotropical biota and contradicted the Pleistocene refugia theory. S2295 |
From: William P. <wil...@ya...> - 2010-03-05 15:29:52
|
On Mar 5, 2010, at 10:22 AM, youjun guo wrote: > Dear All, > > There is a new treebase unit test failure appeared recently. It due to "submissions do not have a related submitter". > > Following query result from the treebase-db may give you a better idea: > > select count(*) from submission where user_id is null : 2044 > > Youjun I think the renewal of data for stage/production has brought back this user_id IS NULL problem. After we're done with migrations, we can reassign all TB1 studies to a new user_id. bp |
From: youjun g. <you...@ya...> - 2010-03-05 15:25:56
|
Please ignore this one if you got my previous message from gmail Dear All, There is a new treebase unit test failure appeared recently. It due to "submissions do not have a related submitter". Following query result from the treebase-db may give you a better idea: select count(*) from submission where user_id is null : 2044 Youjun |
From: youjun g. <you...@gm...> - 2010-03-05 15:22:33
|
Dear All, There is a new treebase unit test failure appeared recently. It due to "submissions do not have a related submitter". Following query result from the treebase-db may give you a better idea: select count(*) from submission where user_id is null : 2044 Youjun |
From: William P. <wil...@ya...> - 2010-03-04 19:04:44
|
On Mar 4, 2010, at 12:50 PM, Rutger Vos wrote: > I don't think I would have expected T or N matrices: I've certainly > never seen a continuous matrix or a distance matrix. But obviously > very many (probably the majority) should be Q matrices. yeah, we don't have code for dealing with T. We do have code for dealing with N, but none of the TB1 data has N. New studies, however, should be able to submit N-type matrices. We have plenty of nucleotide or amino acid matrices, but I think they are all treated as S. This is because I think the real distinction is that S is where each scoring is in its own matrix-element record; while Q is where rows are concatenated into long strings and stored in text fields. We reserved Q as the solution in the event that our software could not perform well enough to store large DNA matrices as S type. (storing a long string as text is obviously more efficient). The downside of Q is that you're limited to 26 + 10 character states (unless we invented a special type of column delimiter), so our first effort was to try to get all discrete data into S. S is more cleanly normalized, but takes up a lot more memory. bp |
From: Rutger V. <rut...@gm...> - 2010-03-04 17:50:41
|
I don't think I would have expected T or N matrices: I've certainly never seen a continuous matrix or a distance matrix. But obviously very many (probably the majority) should be Q matrices. On Thu, Mar 4, 2010 at 4:13 PM, Vladimir Gapeyev <vla...@du...> wrote: > > On Mar 3, 2010, at 11:21 PM, William Piel wrote: > > I'm confused by the meaning of standard vs character matrices. In NEXUS > vernacular, a standard matrix is one that uses discrete characters where the > symbols are arbitrary -- i.e. no assumptions about the meaning of symbols > (by contrast, a nucleotide matrix is one where A, C, G, T, N, and IUPAC are > the assumed symbols). > > For me, the most logical way to divide up types of matrices is the > following: > > 1. distance (= taxa x taxa) vs character (= taxa x character) > 2. and of the character ones: continuous (= floating point) vs discrete (= > integers) > 3. and of the discrete ones: standard (= arbitrary symbols) vs nucleotide (= > a, c, g, t + N and IUPAC) vs amino acid (= 20 letters + X) > > ... but evidently that not happening here. I don't understand the > distinction between character matrix and standard matrix -- sounds like they > are synonymous. > > I agree with you that it certainly sounds like SetMatrixNChar functions to > count the number of characters in a matrix. An UPDATE SQL statement should > be able to do this too -- just a matter of counting the number of > matrixcolumn records for each matrix_id. But that's just a guess -- I would > think that nchar would be passed to the database from headless Mesquite to > the database anyway... > > Regardless of the original issue with running the SetMatrixNChar step (which > I do put aside for now), this sounds to me as a possible issues with > representing matrices in the database. I am not going to probe into this > further now, but here is what I see, which makes me feel odd. > There is a class hierarchy of matrix objects in Java, all of which are > persisted in the same table Matrix, whose field matrixtype is used to > discriminate between the classes. Here are the matrixtype's discriminator > values and the corresponding class hierarchy: > > Matrix (abstract) > 'T' DistanceMatrix > 'C' CharacterMatrix (abstract) > 'N' ContinuousMatrix > 'D' DiscreteMatrix > 'Q' SequenceMatrix > 'S' StandardMatrix > > Despite this elaborateness, all matrices in the database are marked with 'S' > only, both in the pre-migration instance and after I uploaded new matrices > and study metadata. From what Bill says, I would have expected to see some > matrices marked with T, N, and Q as well. > --VG > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com |
From: Vladimir G. <vla...@du...> - 2010-03-04 16:13:42
|
On Mar 3, 2010, at 11:21 PM, William Piel wrote: > > I'm confused by the meaning of standard vs character matrices. In > NEXUS vernacular, a standard matrix is one that uses discrete > characters where the symbols are arbitrary -- i.e. no assumptions > about the meaning of symbols (by contrast, a nucleotide matrix is > one where A, C, G, T, N, and IUPAC are the assumed symbols). > > For me, the most logical way to divide up types of matrices is the > following: > > 1. distance (= taxa x taxa) vs character (= taxa x character) > 2. and of the character ones: continuous (= floating point) vs > discrete (= integers) > 3. and of the discrete ones: standard (= arbitrary symbols) vs > nucleotide (= a, c, g, t + N and IUPAC) vs amino acid (= 20 letters > + X) > > ... but evidently that not happening here. I don't understand the > distinction between character matrix and standard matrix -- sounds > like they are synonymous. > > I agree with you that it certainly sounds like SetMatrixNChar > functions to count the number of characters in a matrix. An UPDATE > SQL statement should be able to do this too -- just a matter of > counting the number of matrixcolumn records for each matrix_id. But > that's just a guess -- I would think that nchar would be passed to > the database from headless Mesquite to the database anyway... Regardless of the original issue with running the SetMatrixNChar step (which I do put aside for now), this sounds to me as a possible issues with representing matrices in the database. I am not going to probe into this further now, but here is what I see, which makes me feel odd. There is a class hierarchy of matrix objects in Java, all of which are persisted in the same table Matrix, whose field matrixtype is used to discriminate between the classes. Here are the matrixtype's discriminator values and the corresponding class hierarchy: Matrix (abstract) 'T' DistanceMatrix 'C' CharacterMatrix (abstract) 'N' ContinuousMatrix 'D' DiscreteMatrix 'Q' SequenceMatrix 'S' StandardMatrix Despite this elaborateness, all matrices in the database are marked with 'S' only, both in the pre-migration instance and after I uploaded new matrices and study metadata. From what Bill says, I would have expected to see some matrices marked with T, N, and Q as well. --VG |
From: William P. <wil...@ya...> - 2010-03-04 04:21:23
|
On Mar 3, 2010, at 10:56 PM, Vladimir Gapeyev wrote: > I removed this columnless matrix. It turned to be associated (via taxonlabels) with the notorious Submission 22, so it should be junk indeed. > > Unfortunately, this did not solve the problem with SetMatrixNChar. I'd need to dig deeper, but I have been stuck on this thing for too long already So, I am going to move with the remaining migration steps: studies metadata, taxon intelligence and citations. I hope I am not wrong in my understanding that they are independent from whatever SetMatrixNChar is supposed to do. > > Meanwhile, if someone can explain, in terms of the data, what is the function of SetMatrixNChar is expected to be, maybe it would be possible to reimplement it in SQL instead of searching for the data point on which it breaks? It appears, the end result is just setting fields nchar and ntax in Matrix. Should I just set them to the number of associated records in the MatrixColumn and MatrixRow tables, respectively? (Another apparent thing actually confuses me: It appears that the resetting code should work only for character matrices, while all matrices present in the DB have matrixtype='S', which stands for "standard" matrix.) I'm confused by the meaning of standard vs character matrices. In NEXUS vernacular, a standard matrix is one that uses discrete characters where the symbols are arbitrary -- i.e. no assumptions about the meaning of symbols (by contrast, a nucleotide matrix is one where A, C, G, T, N, and IUPAC are the assumed symbols). For me, the most logical way to divide up types of matrices is the following: 1. distance (= taxa x taxa) vs character (= taxa x character) 2. and of the character ones: continuous (= floating point) vs discrete (= integers) 3. and of the discrete ones: standard (= arbitrary symbols) vs nucleotide (= a, c, g, t + N and IUPAC) vs amino acid (= 20 letters + X) ... but evidently that not happening here. I don't understand the distinction between character matrix and standard matrix -- sounds like they are synonymous. I agree with you that it certainly sounds like SetMatrixNChar functions to count the number of characters in a matrix. An UPDATE SQL statement should be able to do this too -- just a matter of counting the number of matrixcolumn records for each matrix_id. But that's just a guess -- I would think that nchar would be passed to the database from headless Mesquite to the database anyway... bp |
From: Vladimir G. <vla...@du...> - 2010-03-04 03:56:46
|
I removed this columnless matrix. It turned to be associated (via taxonlabels) with the notorious Submission 22, so it should be junk indeed. Unfortunately, this did not solve the problem with SetMatrixNChar. I'd need to dig deeper, but I have been stuck on this thing for too long already So, I am going to move with the remaining migration steps: studies metadata, taxon intelligence and citations. I hope I am not wrong in my understanding that they are independent from whatever SetMatrixNChar is supposed to do. Meanwhile, if someone can explain, in terms of the data, what is the function of SetMatrixNChar is expected to be, maybe it would be possible to reimplement it in SQL instead of searching for the data point on which it breaks? It appears, the end result is just setting fields nchar and ntax in Matrix. Should I just set them to the number of associated records in the MatrixColumn and MatrixRow tables, respectively? (Another apparent thing actually confuses me: It appears that the resetting code should work only for character matrices, while all matrices present in the DB have matrixtype='S', which stands for "standard" matrix.) Thanks, --VG On Mar 3, 2010, at 6:29 PM, William Piel wrote: > > It's not impossible that there are matrices that will break or > malfunction in the importation. But in this case, since the > tb_matrixid is blank, I don't know what it is. I'm guessing this is > junk that you can delete. > > bp > > > On Mar 3, 2010, at 5:20 PM, Vladimir Gapeyev wrote: > >> Mark's data import instructions have a fix-up step, to be run after >> the matrices upload. As I understand it recomputes character >> counts for matrices, which are computed incorrectly during the >> upload. The step is performed by >> org.cipres.treebase.util.SetMatrixNChar. >> >> This step broke during the migration, even though it was fine on my >> testing data. Investigating the problem suggests it may be due to >> a matrix that has no columns (= no associated records in the >> matrixcolumn table). >> >> matrix_id | tb_matrixid | nexusfilename | >> title | study_id | taxonlabelset_id >> -----------+-------------+---------------------- >> +---------------------------+----------+------------------ >> 3463 | | undone-non-stepm.nex | Untitled >> Character Matrix | | 3596 >> >> The file is "undone-non-step.nex" is not in the DB. >> >> This matrix comes from the pre-migration instance. >> >> I presume it is junk that I can safely delete, which I'll do before >> re-trying SetMatrixNChar. > > William H. Piel > Associate Director for Evolutionary Informatics > Peabody 401, Yale University > 170 Whitney Ave. > New Haven CT 06511 > (203) 436-4957 > wil...@ya... > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev_______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel |
From: William P. <wil...@ya...> - 2010-03-03 23:29:45
|
It's not impossible that there are matrices that will break or malfunction in the importation. But in this case, since the tb_matrixid is blank, I don't know what it is. I'm guessing this is junk that you can delete. bp On Mar 3, 2010, at 5:20 PM, Vladimir Gapeyev wrote: > Mark's data import instructions have a fix-up step, to be run after the matrices upload. As I understand it recomputes character counts for matrices, which are computed incorrectly during the upload. The step is performed by org.cipres.treebase.util.SetMatrixNChar. > > This step broke during the migration, even though it was fine on my testing data. Investigating the problem suggests it may be due to a matrix that has no columns (= no associated records in the matrixcolumn table). > > matrix_id | tb_matrixid | nexusfilename | title | study_id | taxonlabelset_id > -----------+-------------+----------------------+---------------------------+----------+------------------ > 3463 | | undone-non-stepm.nex | Untitled Character Matrix | | 3596 > > The file is "undone-non-step.nex" is not in the DB. > > This matrix comes from the pre-migration instance. > > I presume it is junk that I can safely delete, which I'll do before re-trying SetMatrixNChar. William H. Piel Associate Director for Evolutionary Informatics Peabody 401, Yale University 170 Whitney Ave. New Haven CT 06511 (203) 436-4957 wil...@ya... |
From: Vladimir G. <vla...@du...> - 2010-03-03 22:36:19
|
Mark's data import instructions have a fix-up step, to be run after the matrices upload. As I understand it recomputes character counts for matrices, which are computed incorrectly during the upload. The step is performed by org.cipres.treebase.util.SetMatrixNChar. This step broke during the migration, even though it was fine on my testing data. Investigating the problem suggests it may be due to a matrix that has no columns (= no associated records in the matrixcolumn table). matrix_id | tb_matrixid | nexusfilename | title | study_id | taxonlabelset_id -----------+-------------+---------------------- +---------------------------+----------+------------------ 3463 | | undone-non-stepm.nex | Untitled Character Matrix | | 3596 The file is "undone-non-step.nex" is not in the DB. This matrix comes from the pre-migration instance. I presume it is junk that I can safely delete, which I'll do before re- trying SetMatrixNChar. --VG |
From: William P. <wil...@ya...> - 2010-03-03 21:47:21
|
On Mar 3, 2010, at 3:20 PM, Vladimir Gapeyev wrote: > Bill, I'd like to check with you on the number of new matrices and > trees that were expected to be in the delta. The import tools are > written to skip files with the names that are already in the > database. So, they uploaded about 590 new matrices and 720 new trees > (compared, respectively, to 4348 files in the characters directory and > 5151 files in the trees directory). Does this outcome look about > right? That is, did the data directories contain files that were > loaded into the database earlier and did you NOT expect for them to > affect the database? So, this dump file: http://www.treebase.org/treebase/migration/Dec-09/dump_Dec09_utf8.txt ...only makes reference to a subset of the data found here /Dec-09/characters/ ...and here... /Dec-09/trees/ I had assumed that the way the migration scripts would work is that they would read the dump file, and then only import each matrix or tree as needed (or instructed) by the dump file. i.e., the migration scripts need not have skipped over anything because dump_Dec09_utf8.txt has already done that for you. But I guess in actual fact the migration scripts work differently -- I'm guessing that they first upload all new matrices and trees, and only afterwords wire them together in their proper study record after parsing the dump file. The December 09 dump file contains instructions to upload 284 studies, 560 matrices, and 714 trees. So it's a bit odd that the migration scripts decided that there were 590 matrices and 720 trees to upload. This means that there are 30 matrices and 6 trees that will be uploaded, yet there is no info in the dump file about what studies or analyses they belong to. If you can save a list of these "orphaned" matrices and trees, I can look into what study they should belong to. bp |
From: Vladimir G. <vga...@ne...> - 2010-03-03 20:36:15
|
I am making progress on migrating data to the production instance, but it is going significantly slower than estimated. Just loading matrices and trees took about 26 hours of pure running time instead of the 12 hours for the whole job that I estimated. (It appears that my assumption of linear scaling was wrong. I fear that the cost of uploading a new file is somehow dependent on the size of the database, as operations against the production instance look slower than against my empty testing instance.) These 26 hours do not include times for manual checks and fix-ups that I had to do. It can well take a couple more days from now to get through, provided tools performance and the rate at which problems show up and get mitigated remain the same. (I am currently tracking down a hiccup that may prove trickier -- I'll follow up if so.) Bill, I'd like to check with you on the number of new matrices and trees that were expected to be in the delta. The import tools are written to skip files with the names that are already in the database. So, they uploaded about 590 new matrices and 720 new trees (compared, respectively, to 4348 files in the characters directory and 5151 files in the trees directory). Does this outcome look about right? That is, did the data directories contain files that were loaded into the database earlier and did you NOT expect for them to affect the database? --Vladimir |
From: Vladimir G. <vla...@du...> - 2010-03-01 22:22:47
|
Apologies for the delay... Treebase-dev have been restored (from treebase-stage), and the web app restarted. It is up-to-date w.r.t. commits done Monday. --Vladimir On Mar 1, 2010, at 1:18 PM, Vladimir Gapeyev wrote: > FI: We are going to stop and restore treebasedev DB instance in a few > minutes. > I'll let you know when it is done. > > [Sloppily, I ran the initialization script against it, which must have > messed up many sequences.] > > --VG > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel |