From: Vladimir G. <vla...@du...> - 2010-03-03 22:36:19
|
Mark's data import instructions have a fix-up step, to be run after the matrices upload. As I understand it recomputes character counts for matrices, which are computed incorrectly during the upload. The step is performed by org.cipres.treebase.util.SetMatrixNChar. This step broke during the migration, even though it was fine on my testing data. Investigating the problem suggests it may be due to a matrix that has no columns (= no associated records in the matrixcolumn table). matrix_id | tb_matrixid | nexusfilename | title | study_id | taxonlabelset_id -----------+-------------+---------------------- +---------------------------+----------+------------------ 3463 | | undone-non-stepm.nex | Untitled Character Matrix | | 3596 The file is "undone-non-step.nex" is not in the DB. This matrix comes from the pre-migration instance. I presume it is junk that I can safely delete, which I'll do before re- trying SetMatrixNChar. --VG |
From: William P. <wil...@ya...> - 2010-03-03 23:29:45
|
It's not impossible that there are matrices that will break or malfunction in the importation. But in this case, since the tb_matrixid is blank, I don't know what it is. I'm guessing this is junk that you can delete. bp On Mar 3, 2010, at 5:20 PM, Vladimir Gapeyev wrote: > Mark's data import instructions have a fix-up step, to be run after the matrices upload. As I understand it recomputes character counts for matrices, which are computed incorrectly during the upload. The step is performed by org.cipres.treebase.util.SetMatrixNChar. > > This step broke during the migration, even though it was fine on my testing data. Investigating the problem suggests it may be due to a matrix that has no columns (= no associated records in the matrixcolumn table). > > matrix_id | tb_matrixid | nexusfilename | title | study_id | taxonlabelset_id > -----------+-------------+----------------------+---------------------------+----------+------------------ > 3463 | | undone-non-stepm.nex | Untitled Character Matrix | | 3596 > > The file is "undone-non-step.nex" is not in the DB. > > This matrix comes from the pre-migration instance. > > I presume it is junk that I can safely delete, which I'll do before re-trying SetMatrixNChar. William H. Piel Associate Director for Evolutionary Informatics Peabody 401, Yale University 170 Whitney Ave. New Haven CT 06511 (203) 436-4957 wil...@ya... |
From: Vladimir G. <vla...@du...> - 2010-03-04 03:56:46
|
I removed this columnless matrix. It turned to be associated (via taxonlabels) with the notorious Submission 22, so it should be junk indeed. Unfortunately, this did not solve the problem with SetMatrixNChar. I'd need to dig deeper, but I have been stuck on this thing for too long already So, I am going to move with the remaining migration steps: studies metadata, taxon intelligence and citations. I hope I am not wrong in my understanding that they are independent from whatever SetMatrixNChar is supposed to do. Meanwhile, if someone can explain, in terms of the data, what is the function of SetMatrixNChar is expected to be, maybe it would be possible to reimplement it in SQL instead of searching for the data point on which it breaks? It appears, the end result is just setting fields nchar and ntax in Matrix. Should I just set them to the number of associated records in the MatrixColumn and MatrixRow tables, respectively? (Another apparent thing actually confuses me: It appears that the resetting code should work only for character matrices, while all matrices present in the DB have matrixtype='S', which stands for "standard" matrix.) Thanks, --VG On Mar 3, 2010, at 6:29 PM, William Piel wrote: > > It's not impossible that there are matrices that will break or > malfunction in the importation. But in this case, since the > tb_matrixid is blank, I don't know what it is. I'm guessing this is > junk that you can delete. > > bp > > > On Mar 3, 2010, at 5:20 PM, Vladimir Gapeyev wrote: > >> Mark's data import instructions have a fix-up step, to be run after >> the matrices upload. As I understand it recomputes character >> counts for matrices, which are computed incorrectly during the >> upload. The step is performed by >> org.cipres.treebase.util.SetMatrixNChar. >> >> This step broke during the migration, even though it was fine on my >> testing data. Investigating the problem suggests it may be due to >> a matrix that has no columns (= no associated records in the >> matrixcolumn table). >> >> matrix_id | tb_matrixid | nexusfilename | >> title | study_id | taxonlabelset_id >> -----------+-------------+---------------------- >> +---------------------------+----------+------------------ >> 3463 | | undone-non-stepm.nex | Untitled >> Character Matrix | | 3596 >> >> The file is "undone-non-step.nex" is not in the DB. >> >> This matrix comes from the pre-migration instance. >> >> I presume it is junk that I can safely delete, which I'll do before >> re-trying SetMatrixNChar. > > William H. Piel > Associate Director for Evolutionary Informatics > Peabody 401, Yale University > 170 Whitney Ave. > New Haven CT 06511 > (203) 436-4957 > wil...@ya... > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev_______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel |
From: William P. <wil...@ya...> - 2010-03-04 04:21:23
|
On Mar 3, 2010, at 10:56 PM, Vladimir Gapeyev wrote: > I removed this columnless matrix. It turned to be associated (via taxonlabels) with the notorious Submission 22, so it should be junk indeed. > > Unfortunately, this did not solve the problem with SetMatrixNChar. I'd need to dig deeper, but I have been stuck on this thing for too long already So, I am going to move with the remaining migration steps: studies metadata, taxon intelligence and citations. I hope I am not wrong in my understanding that they are independent from whatever SetMatrixNChar is supposed to do. > > Meanwhile, if someone can explain, in terms of the data, what is the function of SetMatrixNChar is expected to be, maybe it would be possible to reimplement it in SQL instead of searching for the data point on which it breaks? It appears, the end result is just setting fields nchar and ntax in Matrix. Should I just set them to the number of associated records in the MatrixColumn and MatrixRow tables, respectively? (Another apparent thing actually confuses me: It appears that the resetting code should work only for character matrices, while all matrices present in the DB have matrixtype='S', which stands for "standard" matrix.) I'm confused by the meaning of standard vs character matrices. In NEXUS vernacular, a standard matrix is one that uses discrete characters where the symbols are arbitrary -- i.e. no assumptions about the meaning of symbols (by contrast, a nucleotide matrix is one where A, C, G, T, N, and IUPAC are the assumed symbols). For me, the most logical way to divide up types of matrices is the following: 1. distance (= taxa x taxa) vs character (= taxa x character) 2. and of the character ones: continuous (= floating point) vs discrete (= integers) 3. and of the discrete ones: standard (= arbitrary symbols) vs nucleotide (= a, c, g, t + N and IUPAC) vs amino acid (= 20 letters + X) ... but evidently that not happening here. I don't understand the distinction between character matrix and standard matrix -- sounds like they are synonymous. I agree with you that it certainly sounds like SetMatrixNChar functions to count the number of characters in a matrix. An UPDATE SQL statement should be able to do this too -- just a matter of counting the number of matrixcolumn records for each matrix_id. But that's just a guess -- I would think that nchar would be passed to the database from headless Mesquite to the database anyway... bp |
From: Vladimir G. <vla...@du...> - 2010-03-04 16:13:42
|
On Mar 3, 2010, at 11:21 PM, William Piel wrote: > > I'm confused by the meaning of standard vs character matrices. In > NEXUS vernacular, a standard matrix is one that uses discrete > characters where the symbols are arbitrary -- i.e. no assumptions > about the meaning of symbols (by contrast, a nucleotide matrix is > one where A, C, G, T, N, and IUPAC are the assumed symbols). > > For me, the most logical way to divide up types of matrices is the > following: > > 1. distance (= taxa x taxa) vs character (= taxa x character) > 2. and of the character ones: continuous (= floating point) vs > discrete (= integers) > 3. and of the discrete ones: standard (= arbitrary symbols) vs > nucleotide (= a, c, g, t + N and IUPAC) vs amino acid (= 20 letters > + X) > > ... but evidently that not happening here. I don't understand the > distinction between character matrix and standard matrix -- sounds > like they are synonymous. > > I agree with you that it certainly sounds like SetMatrixNChar > functions to count the number of characters in a matrix. An UPDATE > SQL statement should be able to do this too -- just a matter of > counting the number of matrixcolumn records for each matrix_id. But > that's just a guess -- I would think that nchar would be passed to > the database from headless Mesquite to the database anyway... Regardless of the original issue with running the SetMatrixNChar step (which I do put aside for now), this sounds to me as a possible issues with representing matrices in the database. I am not going to probe into this further now, but here is what I see, which makes me feel odd. There is a class hierarchy of matrix objects in Java, all of which are persisted in the same table Matrix, whose field matrixtype is used to discriminate between the classes. Here are the matrixtype's discriminator values and the corresponding class hierarchy: Matrix (abstract) 'T' DistanceMatrix 'C' CharacterMatrix (abstract) 'N' ContinuousMatrix 'D' DiscreteMatrix 'Q' SequenceMatrix 'S' StandardMatrix Despite this elaborateness, all matrices in the database are marked with 'S' only, both in the pre-migration instance and after I uploaded new matrices and study metadata. From what Bill says, I would have expected to see some matrices marked with T, N, and Q as well. --VG |
From: Rutger V. <rut...@gm...> - 2010-03-04 17:50:41
|
I don't think I would have expected T or N matrices: I've certainly never seen a continuous matrix or a distance matrix. But obviously very many (probably the majority) should be Q matrices. On Thu, Mar 4, 2010 at 4:13 PM, Vladimir Gapeyev <vla...@du...> wrote: > > On Mar 3, 2010, at 11:21 PM, William Piel wrote: > > I'm confused by the meaning of standard vs character matrices. In NEXUS > vernacular, a standard matrix is one that uses discrete characters where the > symbols are arbitrary -- i.e. no assumptions about the meaning of symbols > (by contrast, a nucleotide matrix is one where A, C, G, T, N, and IUPAC are > the assumed symbols). > > For me, the most logical way to divide up types of matrices is the > following: > > 1. distance (= taxa x taxa) vs character (= taxa x character) > 2. and of the character ones: continuous (= floating point) vs discrete (= > integers) > 3. and of the discrete ones: standard (= arbitrary symbols) vs nucleotide (= > a, c, g, t + N and IUPAC) vs amino acid (= 20 letters + X) > > ... but evidently that not happening here. I don't understand the > distinction between character matrix and standard matrix -- sounds like they > are synonymous. > > I agree with you that it certainly sounds like SetMatrixNChar functions to > count the number of characters in a matrix. An UPDATE SQL statement should > be able to do this too -- just a matter of counting the number of > matrixcolumn records for each matrix_id. But that's just a guess -- I would > think that nchar would be passed to the database from headless Mesquite to > the database anyway... > > Regardless of the original issue with running the SetMatrixNChar step (which > I do put aside for now), this sounds to me as a possible issues with > representing matrices in the database. I am not going to probe into this > further now, but here is what I see, which makes me feel odd. > There is a class hierarchy of matrix objects in Java, all of which are > persisted in the same table Matrix, whose field matrixtype is used to > discriminate between the classes. Here are the matrixtype's discriminator > values and the corresponding class hierarchy: > > Matrix (abstract) > 'T' DistanceMatrix > 'C' CharacterMatrix (abstract) > 'N' ContinuousMatrix > 'D' DiscreteMatrix > 'Q' SequenceMatrix > 'S' StandardMatrix > > Despite this elaborateness, all matrices in the database are marked with 'S' > only, both in the pre-migration instance and after I uploaded new matrices > and study metadata. From what Bill says, I would have expected to see some > matrices marked with T, N, and Q as well. > --VG > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com |
From: William P. <wil...@ya...> - 2010-03-04 19:04:44
|
On Mar 4, 2010, at 12:50 PM, Rutger Vos wrote: > I don't think I would have expected T or N matrices: I've certainly > never seen a continuous matrix or a distance matrix. But obviously > very many (probably the majority) should be Q matrices. yeah, we don't have code for dealing with T. We do have code for dealing with N, but none of the TB1 data has N. New studies, however, should be able to submit N-type matrices. We have plenty of nucleotide or amino acid matrices, but I think they are all treated as S. This is because I think the real distinction is that S is where each scoring is in its own matrix-element record; while Q is where rows are concatenated into long strings and stored in text fields. We reserved Q as the solution in the event that our software could not perform well enough to store large DNA matrices as S type. (storing a long string as text is obviously more efficient). The downside of Q is that you're limited to 26 + 10 character states (unless we invented a special type of column delimiter), so our first effort was to try to get all discrete data into S. S is more cleanly normalized, but takes up a lot more memory. bp |