treebase-devel Mailing List for TreeBASE (Page 38)

Status: Beta

Brought to you by: hlapp, naturalis, rvos, sfrgpiel

treebase-devel — TreeBASE developer mailing list

You can subscribe to this list here.

2009	Jan	Feb	Mar (1)	Apr (41)	May (41)	Jun (50)	Jul (14)	Aug (21)	Sep (37)	Oct (8)	Nov (4)	Dec (135)
2010	Jan (145)	Feb (110)	Mar (216)	Apr (101)	May (42)	Jun (42)	Jul (23)	Aug (17)	Sep (33)	Oct (15)	Nov (18)	Dec (6)
2011	Jan (8)	Feb (10)	Mar (8)	Apr (41)	May (48)	Jun (62)	Jul (7)	Aug (9)	Sep (7)	Oct (11)	Nov (49)	Dec (1)
2012	Jan (17)	Feb (63)	Mar (4)	Apr (13)	May (17)	Jun (21)	Jul (10)	Aug (10)	Sep	Oct	Nov	Dec (16)
2013	Jan (10)	Feb	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2014	Jan (5)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (5)	Nov	Dec
2015	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 36 37 38 39 40 .. 63 > >> (Page 38 of 63)

[Treebase-devel] Data cleaning setup in SVN

From: Vladimir G. <vla...@du...> - 2010-03-09 22:33:39

On Mar 5, 2010, at 2:14 PM, Vladimir Gapeyev wrote:
>
> To keep track of any possible data cleaning activity, Hilmar was
> suggesting to set up a directory in SVN -- I will soon follow up with
> a proposal. Besides this, I will now turn to documenting and
> depositing migration scripts, etc., until I hear what else needs to be
> done to finalize the migration.

I have now set up a place to keep track of any data cleaning tasks.   
It is in treebase-core/db/cleaning.
I have already added there a task with a script from Bill applied  
during the 1st batch of migration last week.
--Vladimir

Re: [Treebase-devel] Final migration increment

From: Hilmar L. <hl...@ne...> - 2010-03-09 18:21:59

On Mar 9, 2010, at 12:26 PM, Vladimir Gapeyev wrote:

> If these are minor in the large scheme of things, that's ok with me.


Some of these are minor, and some of these don't seem different to me  
from data maintenance tasks that I expect to continue to come up, so  
we're not ridding ourselves of the need to run such things with due  
diligence (full use of transactions, prior testing on a separate  
instance, etc) by avoiding this one.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
===========================================================

Re: [Treebase-devel] Final migration increment

From: Vladimir G. <vla...@du...> - 2010-03-09 17:43:56

On Mar 9, 2010, at 12:05 PM, William Piel wrote:

> I thought this was the "normal" way that the migration proceeds --  
> i.e. with the following 3 stages:
>
> 1. parse and upload the dump.txt file and associated trees and  
> matrices. TB1 has all citations in one line -- this is crammed in  
> the title field, although full names and email addresses of authors  
> are stored in separate tables but without info on author order.
>
> 2. replace the existing taxon_variant and taxon tables with the  
> latest TI dump.
>
> 3. update the citation information with the latest Endnote file.  
> Here, author names are abbreviated (per Endnote conventions) but  
> author order is known.
>
> Is this the same basic order of tasks which you used for the Dec09  
> migration?

That's correct.

> The only difference here is that we can go live before task 3 is  
> performed. And note that we could task our undergrads with editing  
> the citation info directly with the TreeBASE2 interface instead of  
> first editing an Endnote file. i.e., we actually do away with step 3  
> as it stands.  (Although seeing as our metadata student help (in  
> Endnote) has improved metadata for all TreeBASE studies  
> considerably, we'll probably want to run a citation update script at  
> some point anyway).

-- See the other reply. 

> Do we need to think about how we will run update scripts and data  
> cleansing scripts in future?

I presume we should.

> I'm not sure what's the point of "stage" other than the ability to  
> test new builds against a (slightly older) version of the production  
> data.  For update and data cleansing scripts, we will need to apply  
> them directly to production (after first triggering a pg_dump, of  
> course).

That's the plan for the post-release life.  At the moment, we use the  
two instances as two alternative containers for the main data set: one  
is read-only (conceptually) and the other is the working copy, with  
these roles flipping back and forth depending whether a migration is  
in progress.

--VG

Re: [Treebase-devel] Final migration increment

From: Vladimir G. <vla...@du...> - 2010-03-09 17:26:34

On Mar 9, 2010, at 11:24 AM, Hilmar Lapp wrote:

>
> On Mar 9, 2010, at 10:21 AM, Vladimir Gapeyev wrote:
>
>>  (4) Some bibliography entries in the release will have all info
>> crammed into the title field.  This will be fixed after the release.
>>
>> I am worried about the logistics of (4), and would prefer to have it
>> done prior to the release.
>
>
> What are your main worries? Also, assuming that this would need to  
> be done manually (would it?), do we have an estimate of how long it  
> would take?
> If we do this after the release, what are the risks, or would the  
> effort increase significantly? (E.g., might this work cause clashes  
> with concurrent submissions?)

The operation is scripted, but it will have to be done on the live  
production instance, so the worries come from that: malfunction or  
downtime of the production site if anything goes wrong with this  
operation. Also, is it known that the web front end operates  
reasonably when all biblio info is in the title field and the other  
fields are null?

If these are minor in the large scheme of things, that's ok with me.

--VG

Re: [Treebase-devel] Final migration increment

From: William P. <wil...@ya...> - 2010-03-09 17:05:50

On Mar 9, 2010, at 10:21 AM, Vladimir Gapeyev wrote:

>   (4) Some bibliography entries in the release will have all info  
> crammed into the title field.  This will be fixed after the release.
> 
> I am worried about the logistics of (4), and would prefer to have it  
> done prior to the release.

I thought this was the "normal" way that the migration proceeds -- i.e. with the following 3 stages:

1. parse and upload the dump.txt file and associated trees and matrices. TB1 has all citations in one line -- this is crammed in the title field, although full names and email addresses of authors are stored in separate tables but without info on author order.

2. replace the existing taxon_variant and taxon tables with the latest TI dump.

3. update the citation information with the latest Endnote file. Here, author names are abbreviated (per Endnote conventions) but author order is known.

Is this the same basic order of tasks which you used for the Dec09 migration? 

The only difference here is that we can go live before task 3 is performed. And note that we could task our undergrads with editing the citation info directly with the TreeBASE2 interface instead of first editing an Endnote file. i.e., we actually do away with step 3 as it stands.  (Although seeing as our metadata student help (in Endnote) has improved metadata for all TreeBASE studies considerably, we'll probably want to run a citation update script at some point anyway). 

Do we need to think about how we will run update scripts and data cleansing scripts in future? I'm not sure what's the point of "stage" other than the ability to test new builds against a (slightly older) version of the production data.  For update and data cleansing scripts, we will need to apply them directly to production (after first triggering a pg_dump, of course). 

bp

Re: [Treebase-devel] Final migration increment

From: Hilmar L. <hl...@ne...> - 2010-03-09 16:24:20

On Mar 9, 2010, at 10:21 AM, Vladimir Gapeyev wrote:

>   (4) Some bibliography entries in the release will have all info
> crammed into the title field.  This will be fixed after the release.
>
> I am worried about the logistics of (4), and would prefer to have it
> done prior to the release.

What are your main worries? Also, assuming that this would need to be  
done manually (would it?), do we have an estimate of how long it would  
take?

If we do this after the release, what are the risks, or would the  
effort increase significantly? (E.g., might this work cause clashes  
with concurrent submissions?)

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
===========================================================

Re: [Treebase-devel] Final migration increment

From: William P. <wil...@ya...> - 2010-03-09 15:23:31

I just calculated that since early January, TB1 has acquired 6,603 distinct taxon labels that are new to the database. I'm working on mapping them now. 

bp

Re: [Treebase-devel] Final migration increment

From: Vladimir G. <vla...@du...> - 2010-03-09 15:22:07

On Mar 8, 2010, at 9:28 PM, William Piel wrote:

> At any rate, we have the last migration dump here:
>
> http://treebase.peabody.yale.edu/treebase/migration/Mar-10/
>
> The citations and TI are not there yet, but Vladimir can start  
> migrating the trees, characters, and the dump file. Meanwhile I'll  
> be working on the TI mapping -- that's critical for release. For  
> now, citations can be stored in title field as one-liners (this is  
> not critical for release, so I'll be working on them after the TI is  
> done).

Ok, I take it as this:

   (1) The post-migration instance at treebase-stage is currently  
being used for informational purposes only: no data changes that need  
to be preserved will be done to this instance.

   (2) I should now start loading the final data batch into the  
production instance.

   (3) After the final load is done to production, we will again  
replicate it to staging, where we will do manual pre-release data  
cleaning.

   (4) Some bibliography entries in the release will have all info  
crammed into the title field.  This will be fixed after the release.

I am worried about the logistics of (4), and would prefer to have it  
done prior to the release.

(As a reminder, the hassle with copying back and forth between the  
production and staging instances is because the production server is  
significantly faster and loading of matrices and trees takes awhile.)

Re: [Treebase-devel] Final migration increment

From: William P. <wil...@ya...> - 2010-03-09 02:28:57

On Mar 8, 2010, at 6:37 PM, Hilmar Lapp wrote:

> I've sent it to you, Vladimir, and Youjun. You actually responded to it. It's below FYI.

Ah -- yes, I see. I thought you were referring to a more recent communication. Since the migration was not "going smoothly" I ended waiting for more hopeful signs. 

At any rate, we have the last migration dump here:

http://treebase.peabody.yale.edu/treebase/migration/Mar-10/

The citations and TI are not there yet, but Vladimir can start migrating the trees, characters, and the dump file. Meanwhile I'll be working on the TI mapping -- that's critical for release. For now, citations can be stored in title field as one-liners (this is not critical for release, so I'll be working on them after the TI is done).

It's the last sprint-to-the-finish!

bp

Re: [Treebase-devel] Final migration increment

From: William P. <wil...@ya...> - 2010-03-08 05:40:08

Hi all,

I'm catching up on things, having been away this weekend. 

On Mar 5, 2010, at 2:14 PM, Vladimir Gapeyev wrote:
> A copy of the migrated data is now in the treebasestage DB instance. 

Fabulous!

On Mar 6, 2010, at 5:06 PM, Hilmar Lapp wrote:
> Hi Bill,
> 
> as I emailed you earlier, the release plan put the date for shutting down TB1 on last Wed, followed by creating the dumps for the final migration increment. Vladimir tells me that he does not have those dumps. Can you update me where that stands?
> 
> 	-hilmar

I've searched everywhere, but I can't find that email. Did you send it to tre...@li... ?

At any rate, with this 2009 migration done I'll be shutting down new submissions to TB1 on Monday. I have a number of ready submissions still to process, but then it's the final dump. 

Meanwhile, is there a system or strategy for testing the 2009 migrated data -- or just manually picking out some studies at random to test?

While TB1 submissions are shut down, the final handful of level-9 bugs can be addressed. 

bp

Re: [Treebase-devel] new unit test failure

From: Vladimir G. <vla...@du...> - 2010-03-05 19:22:46

On Mar 5, 2010, at 10:29 AM, William Piel wrote:

>
> On Mar 5, 2010, at 10:22 AM, youjun guo wrote:
>
>> Dear All,
>>
>> There is a new treebase unit test failure appeared recently. It due  
>> to "submissions do not have a related submitter".
>>
>> Following query result from the treebase-db may give you a  better  
>> idea:
>>
>> select count(*) from submission where user_id is null : 2044
>>
>> Youjun
>
>
> I think the renewal of data for stage/production has brought back  
> this user_id IS NULL problem. After we're done with migrations, we  
> can reassign all TB1 studies to a new user_id.

This must have been caused by the treebase-dev restore from an older  
backup earlier this week.

I'd suggest to ignore this test on treebase-dev for now; resolve the  
problem on treebase-stage, as part of other data cleaning; and then  
replicate the data to treebase-dev and try the tests again.

--VG

[Treebase-devel] Data migration for Dec 2009 is (mostly) compete

From: Vladimir G. <vga...@ne...> - 2010-03-05 19:14:18

A copy of the migrated data is now in the treebasestage DB instance.   
The http://treebase-stage.nescent.org front end should be up shortly.   
We have "named" backups for the pre- and post- migration copies. Also,  
the treebase-dev instance is reasonably close to the pre-migration  
copy, as it was restored earlier this week from a production backup.

I suggest that the staging DB instance be used for any necessary  
inspecting and cleaning -- all interested parties should have suitable  
access rights. We'll then migrate the Dec-March delta into this  
modified instance (or to its copy on production).

We should probably discuss (recall?) what are the next steps.

To keep track of any possible data cleaning activity, Hilmar was  
suggesting to set up a directory in SVN -- I will soon follow up with  
a proposal. Besides this, I will now turn to documenting and  
depositing migration scripts, etc., until I hear what else needs to be  
done to finalize the migration.


I am aware of a few issues with the post-migration data that may  
warrant further attention:

(1) Matrix loading broke a couple times, running out of heap space  
(there is probably a memory leak).  This left the matrices on which it  
broke in inconsistent state.  I scraped out all (I hope) data  
associated with these matrices and loaded them again.  Still, this was  
a deviation from the straight path.  The matrices were
     Filename       matrix_id(scraped)   matrix_id(reloaded)
     M4374.nex      4678                           4718
     M4622.nex      4210                           4717

(2) The citations.txt contained 4 "Book Section" entries, which broke  
the citation import tool.  Since these were only a handful, instead of  
tracking down the problem I decided to skip these entries (I commented  
them out).  Hopefully, it will be possible to enter them by hand  
later.  FYI, the 4 entries are appended below.

(3) As we previously noted with Bill, it appeared that the migration  
process picked up a few more matrix and tree files than Bill expected  
were new ones in the delta. Here are the orphaned trees and matrices  
that I could detect after the migration.
Compared to the list of matrices with null study_id in treebase-dev  
(which is almost identical to the pre-migration copy), the post- 
migration copy has 4 more such matrices:
  matrix_id | nexusfilename
       4227 | M4864.nex
       4280 | M4470.nex
       4456 | M4886.nex
       4528 | M4863.nex
Trees that belong to the fake study (#2264) used by the migration  
process for initial uploading (and do not belong to this study in  
treebase-dev):
  phylotree_id |  nexusfilename
--------------+-----------------
          6074 | S1934A11000.tre
          6075 | S1934A11000.tre
          6176 | S1815A10024.tre
          6358 | S1934A11001.tre
          6433 | S1319A11058.tre
          6521 | S1319A11057.tre
I should mention that there are about 26 more matrices and 52 more  
trees with the null study_id, coming from the pre-migration instance.


--Vladimir



=== These citations were skipped during the migration ====

*Book Section   Columbus, J. T.; Peterson, P. M.; Refulio Rodriguez,  
N. F.; Cerros Tlatilpa, R.; Kinney, M. S.  2009    Phylogenetics of  
Muhlenbergiinae (Poaceae, Chloridoideae, Cynodonteae) based \
on ITS and trnL-F DNA sequences    Seberg, O.; Petersen, G.; Barfod,  
A. S.; Davis, J. I.   Proceedings of the Fourth International  
Conference on the Comparative Biology of the Monocotyledons and Th\
e Fifth International Symposium on Grass Systematics and  
Evolution 
                                                                             in 
  press                Muhlenbergiinae are a subtribe \
in the grass (Poaceae) subfamily Chloridoideae, tribe Cynodonteae. The  
morphologically diverse group includes ten genera and 168 species and  
is restricted almost entirely to the New World, with a c\
enter of diversity in Mexico (125 species). With 147 species,  
Muhlenbergia is by far the largest genus, and is divided into two  
subgenera, Muhlenbergia and Trichochloa, the latter with two sections\
. The other, much smaller genera are Aegopogon (4 species), Bealia  
(1), Blepharoneuron (2), Chaboissaea (4), Lycurus (3), Pereilema (4),  
Redfieldia (1), Schaffnerella (1), and Schedonnardus (1). We\
  conducted a phylogenetic study of Muhlenbergiinae based on parsimony  
analysis of DNA sequences of the nuclear ribosomal internal  
transcribed spacer region (ITS1 + 5.8S + ITS2) and chloroplast trnL\
  intron, trnL 3&#8242; exon, and trnL trnF intergenic spacer. All  
genera were sampled, including 52 species of Muhlenbergia representing  
both subgenera and sections. Muhlenbergia and Pereilema are \
not monophyletic in the resulting trees. The species of Pereilema and  
the other small genera are nested within Muhlenbergia in three main  
lineages. One of the lineages includes a monophyletic Muhle\
nbergia subgen. Trichochloa. Another lineage comprises species having  
leaf anatomy predictive of the PCK subtype of C4 photosynthesis. Based  
on the results of this study, we favor expanding the cir\
cumscription of Muhlenbergia to include the other nine genera of the  
subtribe.        S2438

*Book Section   Duvall, M. R.; Leseberg, C. H.; Grennan, C. P.;  
Morris, L. M.   2009    Molecular evolution and phylogenetics of  
complete chloroplast genomes in Poaceae        Davis, J.       Fifth\
  International Symposium on Grass Systematics and  
Evolution 
                                                                                 in 
  press                Phylogenetic issues in Poaceae not\
  resolved by previous multi-gene analyses can be usefully  
investigated by small genome-scale analyses. In this pilot study,  
complete or nearly complete chloroplast genomes (plastomes) were  
sequence\
d from six selected graminoids. Representatives of Anomochlooideae,  
Puelioideae, Bambusoideae, both major tribes of Panicoideae and  
Joinvilleaceae were newly sampled to supplement previously publis\
hed plastome data from Ehrhartoideae, Pooideae, and Andropogoneae. For  
amplification and sequencing, over 200 pairs of primers were designed  
in conserved regions of published grass plastomes that w\
ere positioned to flank overlapping 1200-base pair fragments around  
the entire plastome. As expected, gene order and number were highly  
conserved. Concurrent with the high conservation of the plast\
ome was considerable cumulative variation useful for studies within  
the family and even within a single tribe. Readily interpreted  
mutational patterns were observed, such as small inversions of the\
  loop in hairpin-loop regions and indels resulting from slipped- 
strand mispairings. Phylogenetic analyses were conducted on these and  
eight previously published plastomes. Maximum or near-maximum s\
upport was observed in all likelihood and parsimony bootstrap analyses  
including shallow nodes, such as those within a clade corresponding to  
a complex of four Andropogoneae, and deep nodes, such a\
s the one uniting the bambusoid/ehrhartoid/pooid (BEP) clade.      S2429

*Book Section   Prince, L.      2009    Phylogenetic relationships and  
species delimitation in Canna (Cannaceae)        Seberg, O.; Petersen,  
G.; Barfod, A. S.; Davis, J.      Proceedings of the Fo\
urth International Conference on the Comparative Biology of the  
Monocotyledons and The Fifth International Symposium on Grass  
Systematics and Evolution                                              \
                                               Canna lilies are a  
conspicuous component of tropical and subtropical humid Neotropics  
where they are native, and the Asian Paleotropics where they have\
  been introduced. Cannas have been cultivated as a food item  
(rhizome), for wrapping (leaves), and as beads (seeds) for millennia  
by indigenous people. In both tropical and temperate regions they h\
ave a long history as ornamental plants as well. With only a few dozen  
taxa in a single genus, Cannaceae has much lower generic and species  
diversity than its sister family, Marantaceae (550 specie\
s in 31 genera). Parsimony and Bayesian analyses of nuclear ribosomal  
internal transcribed spacer (ITS) and chloroplast non-coding sequence  
data (trnE-T intergenic spacer and rpL16 intron) were use\
d to infer evolutionary relationships among species. Potential causes  
of non-monophyly of nuclear ITS haplotypes and conflict between  
nuclear and plastid phylogenies for some samples are discussed.\
  Chloroplast (rbcL, ndhF) DNA data indicate a North American taxon,  
Canna flaccida, is sister to all other species in the genus.  
Phylogenetic analyses are consistent with the hypothesis of a South \
American origin for the genus, followed by dispersal and migration to  
North and Central America, and the Caribbean.     S2373

*Book Section   Roncal, J.; Borchsenius, F.; Asmussen-Lange, C. B.;  
Balslev, H. 2009    Divergence times in tribe Geonomateae (Arecaceae)  
coincide with tertiary geological events. The Comparative B\
iology of the Monocotyledons               Proceedings of the Fourth  
International Conference              Aarhus University  
Press                                                         in  
press  \
               The Geonomateae is a species-rich palm tribe restricted  
to the Neotropics with a concentration of species in western Colombia  
and adjacent Central America with extensions along the An\
des. We estimated divergence times for the Geonomateae based on a  
phylogeny resulting from analysis of two low-copy nuclear DNA genes  
and using a Bayesian relaxed molecular clock method. We obtaine\
d calibration points from the fossil record and previous dated  
phylogenies in the Arecaceae. The results indicated a diversification  
of the tribe during the Oligocene at around 31 million years ago\
. Divergence time of a high elevation Geonoma clade from 3.8 to 9.2  
million years ago coincided with the Andean uplift. A clade of Geonoma  
species from the Brazilian Shield was contemporary with Mi\
ocene marine incursions in South America. The most likely scenario to  
explain the arrival of the Calyptronoma-Calyptrogyne ancestor in the  
Greater Antilles is a migration through a dry-land connect\
ion between Central or South America prior to the formation of the  
Panamanian isthmus and after 27.8 million years ago. The molecular  
dating results were consistent with the growing evidence of a T\
ertiary diversification for most Neotropical biota and contradicted  
the Pleistocene refugia theory.     S2295

Re: [Treebase-devel] new unit test failure

From: William P. <wil...@ya...> - 2010-03-05 15:29:52

On Mar 5, 2010, at 10:22 AM, youjun guo wrote:

> Dear All,
> 
> There is a new treebase unit test failure appeared recently. It due to "submissions do not have a related submitter".
> 
> Following query result from the treebase-db may give you a  better idea:
> 
> select count(*) from submission where user_id is null : 2044
> 
> Youjun  

I think the renewal of data for stage/production has brought back this user_id IS NULL problem. After we're done with migrations, we can reassign all TB1 studies to a new user_id. 

bp

[Treebase-devel] new unit test failure

From: youjun g. <you...@ya...> - 2010-03-05 15:25:56

Please ignore this one if you got my previous message from gmail

Dear All,

There is a new treebase unit test failure appeared recently. It due to
"submissions do not have a related submitter".

Following query result from the treebase-db may give you a  better idea:

select count(*) from submission where user_id is null : 2044

Youjun

[Treebase-devel] new unit test failure

From: youjun g. <you...@gm...> - 2010-03-05 15:22:33

Dear All,

There is a new treebase unit test failure appeared recently. It due to
"submissions do not have a related submitter".

Following query result from the treebase-db may give you a  better idea:

select count(*) from submission where user_id is null : 2044

Youjun

Re: [Treebase-devel] Are "matrixtypes" ignored?

From: William P. <wil...@ya...> - 2010-03-04 19:04:44

On Mar 4, 2010, at 12:50 PM, Rutger Vos wrote:

> I don't think I would have expected T or N matrices: I've certainly
> never seen a continuous matrix or a distance matrix. But obviously
> very many (probably the majority) should be Q matrices.

yeah, we don't have code for dealing with T. We do have code for dealing with N, but none of the TB1 data has N. New studies, however, should be able to submit N-type matrices. 

We have plenty of nucleotide or amino acid matrices, but I think they are all treated as S. This is because I think the real distinction is that S is where each scoring is in its own matrix-element record; while Q is where rows are concatenated into long strings and stored in text fields. We reserved Q as the solution in the event that our software could not perform well enough to store large DNA matrices as S type. (storing a long string as text is obviously more efficient). The downside of Q is that you're limited to 26 + 10 character states (unless we invented a special type of column delimiter), so our first effort was to try to get all discrete data into S.  S is more cleanly normalized, but takes up a lot more memory. 

bp

Re: [Treebase-devel] Are "matrixtypes" ignored?

From: Rutger V. <rut...@gm...> - 2010-03-04 17:50:41

I don't think I would have expected T or N matrices: I've certainly
never seen a continuous matrix or a distance matrix. But obviously
very many (probably the majority) should be Q matrices.

On Thu, Mar 4, 2010 at 4:13 PM, Vladimir Gapeyev
<vla...@du...> wrote:
>
> On Mar 3, 2010, at 11:21 PM, William Piel wrote:
>
> I'm confused by the meaning of standard vs character matrices. In NEXUS
> vernacular, a standard matrix is one that uses discrete characters where the
> symbols are arbitrary -- i.e. no assumptions about the meaning of symbols
> (by contrast, a nucleotide matrix is one where A, C, G, T, N, and IUPAC are
> the assumed symbols).
>
> For me, the most logical way to divide up types of matrices is the
> following:
>
> 1. distance (= taxa x taxa) vs character (= taxa x character)
> 2. and of the character ones: continuous (= floating point) vs discrete (=
> integers)
> 3. and of the discrete ones: standard (= arbitrary symbols) vs nucleotide (=
> a, c, g, t + N and IUPAC) vs amino acid (= 20 letters + X)
>
> ... but evidently that not happening here. I don't understand the
> distinction between character matrix and standard matrix -- sounds like they
> are synonymous.
>
> I agree with you that it certainly sounds like SetMatrixNChar functions to
> count the number of characters in a matrix. An UPDATE SQL statement should
> be able to do this too -- just a matter of counting the number of
> matrixcolumn records for each matrix_id. But that's just a guess -- I would
> think that nchar would be passed to the database from headless Mesquite to
> the database anyway...
>
> Regardless of the original issue with running the SetMatrixNChar step (which
> I do put aside for now), this sounds to me as a possible issues with
> representing matrices in the database. I am not going to probe into this
> further now, but here is what I see, which makes me feel odd.
> There is a class hierarchy of matrix objects in Java, all of which are
> persisted in the same table Matrix, whose field matrixtype is used to
> discriminate between the classes.   Here are the matrixtype's discriminator
> values and the corresponding class hierarchy:
>
>      Matrix (abstract)
> 'T'      DistanceMatrix
> 'C'      CharacterMatrix (abstract)
> 'N'          ContinuousMatrix
> 'D'          DiscreteMatrix
> 'Q'              SequenceMatrix
> 'S'              StandardMatrix
>
> Despite this elaborateness, all matrices in the database are marked with 'S'
> only, both in the pre-migration instance and after I uploaded new matrices
> and study metadata.  From what Bill says, I would have expected to see some
> matrices marked with T, N, and Q as well.
> --VG
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Treebase-devel mailing list
> Tre...@li...
> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>
>



-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com

[Treebase-devel] Are "matrixtypes" ignored?

From: Vladimir G. <vla...@du...> - 2010-03-04 16:13:42

On Mar 3, 2010, at 11:21 PM, William Piel wrote:
>
> I'm confused by the meaning of standard vs character matrices. In  
> NEXUS vernacular, a standard matrix is one that uses discrete  
> characters where the symbols are arbitrary -- i.e. no assumptions  
> about the meaning of symbols (by contrast, a nucleotide matrix is  
> one where A, C, G, T, N, and IUPAC are the assumed symbols).
>
> For me, the most logical way to divide up types of matrices is the  
> following:
>
> 1. distance (= taxa x taxa) vs character (= taxa x character)
> 2. and of the character ones: continuous (= floating point) vs  
> discrete (= integers)
> 3. and of the discrete ones: standard (= arbitrary symbols) vs  
> nucleotide (= a, c, g, t + N and IUPAC) vs amino acid (= 20 letters  
> + X)
>
> ... but evidently that not happening here. I don't understand the  
> distinction between character matrix and standard matrix -- sounds  
> like they are synonymous.
>
> I agree with you that it certainly sounds like SetMatrixNChar  
> functions to count the number of characters in a matrix. An UPDATE  
> SQL statement should be able to do this too -- just a matter of  
> counting the number of matrixcolumn records for each matrix_id. But  
> that's just a guess -- I would think that nchar would be passed to  
> the database from headless Mesquite to the database anyway...

Regardless of the original issue with running the SetMatrixNChar step  
(which I do put aside for now), this sounds to me as a possible issues  
with representing matrices in the database. I am not going to probe  
into this further now, but here is what I see, which makes me feel odd.

There is a class hierarchy of matrix objects in Java, all of which are  
persisted in the same table Matrix, whose field matrixtype is used to  
discriminate between the classes.   Here are the matrixtype's  
discriminator values and the corresponding class hierarchy:

      Matrix (abstract)
'T'      DistanceMatrix
'C'      CharacterMatrix (abstract)
'N'          ContinuousMatrix
'D'          DiscreteMatrix
'Q'              SequenceMatrix
'S'              StandardMatrix

Despite this elaborateness, all matrices in the database are marked  
with 'S' only, both in the pre-migration instance and after I uploaded  
new matrices and study metadata.  From what Bill says, I would have  
expected to see some matrices marked with T, N, and Q as well.

--VG

Re: [Treebase-devel] Problem with SetMatrixNChar

From: William P. <wil...@ya...> - 2010-03-04 04:21:23

On Mar 3, 2010, at 10:56 PM, Vladimir Gapeyev wrote:

> I removed this columnless matrix. It turned to be associated (via taxonlabels) with the notorious Submission 22, so it should be junk indeed. 
> 
> Unfortunately, this did not solve the problem with  SetMatrixNChar.  I'd need to dig deeper, but I have been stuck on this thing for too long already  So, I am going to move with the remaining migration steps: studies metadata, taxon intelligence and citations.  I hope I am not wrong in my understanding  that they are independent from whatever  SetMatrixNChar is supposed to do.
> 
> Meanwhile, if someone can explain, in terms of the data, what is the function of SetMatrixNChar is expected to be, maybe it would be possible to reimplement it in SQL instead of searching for the data point on which it breaks?   It appears, the end result is just setting fields nchar and ntax in Matrix.  Should I just set them to the number of associated records in the MatrixColumn and MatrixRow tables, respectively? (Another apparent thing actually confuses me: It appears that the resetting code should work only for character matrices, while all matrices present in the DB have matrixtype='S', which stands for "standard" matrix.)

I'm confused by the meaning of standard vs character matrices. In NEXUS vernacular, a standard matrix is one that uses discrete characters where the symbols are arbitrary -- i.e. no assumptions about the meaning of symbols (by contrast, a nucleotide matrix is one where A, C, G, T, N, and IUPAC are the assumed symbols). 

For me, the most logical way to divide up types of matrices is the following:

1. distance (= taxa x taxa) vs character (= taxa x character)
2. and of the character ones: continuous (= floating point) vs discrete (= integers)
3. and of the discrete ones: standard (= arbitrary symbols) vs nucleotide (= a, c, g, t + N and IUPAC) vs amino acid (= 20 letters + X)

... but evidently that not happening here. I don't understand the distinction between character matrix and standard matrix -- sounds like they are synonymous.

I agree with you that it certainly sounds like SetMatrixNChar functions to count the number of characters in a matrix. An UPDATE SQL statement should be able to do this too -- just a matter of counting the number of matrixcolumn records for each matrix_id. But that's just a guess -- I would think that nchar would be passed to the database from headless Mesquite to the database anyway...

bp

[Treebase-devel] Problem with SetMatrixNChar

From: Vladimir G. <vla...@du...> - 2010-03-04 03:56:46

I removed this columnless matrix. It turned to be associated (via  
taxonlabels) with the notorious Submission 22, so it should be junk  
indeed.

Unfortunately, this did not solve the problem with  SetMatrixNChar.   
I'd need to dig deeper, but I have been stuck on this thing for too  
long already  So, I am going to move with the remaining migration  
steps: studies metadata, taxon intelligence and citations.  I hope I  
am not wrong in my understanding  that they are independent from  
whatever  SetMatrixNChar is supposed to do.

Meanwhile, if someone can explain, in terms of the data, what is the  
function of SetMatrixNChar is expected to be, maybe it would be  
possible to reimplement it in SQL instead of searching for the data  
point on which it breaks?   It appears, the end result is just setting  
fields nchar and ntax in Matrix.  Should I just set them to the number  
of associated records in the MatrixColumn and MatrixRow tables,  
respectively? (Another apparent thing actually confuses me: It appears  
that the resetting code should work only for character matrices, while  
all matrices present in the DB have matrixtype='S', which stands for  
"standard" matrix.)

Thanks,
--VG

On Mar 3, 2010, at 6:29 PM, William Piel wrote:

>
> It's not impossible that there are matrices that will break or  
> malfunction in the importation. But in this case, since the  
> tb_matrixid is blank, I don't know what it is. I'm guessing this is  
> junk that you can delete.
>
> bp
>
>
> On Mar 3, 2010, at 5:20 PM, Vladimir Gapeyev wrote:
>
>> Mark's data import instructions have a fix-up step, to be run after  
>> the matrices upload.  As I understand it recomputes character  
>> counts for matrices, which are computed incorrectly during the  
>> upload.  The step is performed by  
>> org.cipres.treebase.util.SetMatrixNChar.
>>
>> This step broke during the migration, even though it was fine on my  
>> testing data.   Investigating the problem suggests it may be due to  
>> a matrix that has no columns (= no associated records in the  
>> matrixcolumn table).
>>
>>  matrix_id | tb_matrixid |    nexusfilename     |            
>> title           | study_id | taxonlabelset_id
>> -----------+-------------+---------------------- 
>> +---------------------------+----------+------------------
>>       3463 |             | undone-non-stepm.nex | Untitled  
>> Character Matrix |          |             3596
>>
>> The file is "undone-non-step.nex" is not in the DB.
>>
>> This matrix  comes from the pre-migration instance.
>>
>> I presume it is junk that I can safely delete, which I'll do before  
>> re-trying SetMatrixNChar.
>
> William H. Piel
> Associate Director for Evolutionary Informatics
> Peabody 401, Yale University
> 170 Whitney Ave.
> New Haven CT 06511
> (203) 436-4957
> wil...@ya...
>
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev_______________________________________________
> Treebase-devel mailing list
> Tre...@li...
> https://lists.sourceforge.net/lists/listinfo/treebase-devel

Re: [Treebase-devel] A columnless matrix

From: William P. <wil...@ya...> - 2010-03-03 23:29:45

It's not impossible that there are matrices that will break or malfunction in the importation. But in this case, since the tb_matrixid is blank, I don't know what it is. I'm guessing this is junk that you can delete.

bp


On Mar 3, 2010, at 5:20 PM, Vladimir Gapeyev wrote:

> Mark's data import instructions have a fix-up step, to be run after the matrices upload.  As I understand it recomputes character counts for matrices, which are computed incorrectly during the upload.  The step is performed by org.cipres.treebase.util.SetMatrixNChar. 
> 
> This step broke during the migration, even though it was fine on my testing data.   Investigating the problem suggests it may be due to a matrix that has no columns (= no associated records in the matrixcolumn table). 
> 
>  matrix_id | tb_matrixid |    nexusfilename     |           title           | study_id | taxonlabelset_id 
> -----------+-------------+----------------------+---------------------------+----------+------------------
>       3463 |             | undone-non-stepm.nex | Untitled Character Matrix |          |             3596
> 
> The file is "undone-non-step.nex" is not in the DB.  
> 
> This matrix  comes from the pre-migration instance.  
> 
> I presume it is junk that I can safely delete, which I'll do before re-trying SetMatrixNChar. 

William H. Piel
Associate Director for Evolutionary Informatics
Peabody 401, Yale University
170 Whitney Ave.
New Haven CT 06511
(203) 436-4957
wil...@ya...

[Treebase-devel] A columnless matrix

From: Vladimir G. <vla...@du...> - 2010-03-03 22:36:19

Mark's data import instructions have a fix-up step, to be run after  
the matrices upload.  As I understand it recomputes character counts  
for matrices, which are computed incorrectly during the upload.  The  
step is performed by org.cipres.treebase.util.SetMatrixNChar.

This step broke during the migration, even though it was fine on my  
testing data.   Investigating the problem suggests it may be due to a  
matrix that has no columns (= no associated records in the  
matrixcolumn table).

  matrix_id | tb_matrixid |    nexusfilename     |            
title           | study_id | taxonlabelset_id
-----------+-------------+---------------------- 
+---------------------------+----------+------------------
       3463 |             | undone-non-stepm.nex | Untitled Character  
Matrix |          |             3596

The file is "undone-non-step.nex" is not in the DB.

This matrix  comes from the pre-migration instance.

I presume it is junk that I can safely delete, which I'll do before re- 
trying SetMatrixNChar.

--VG

Re: [Treebase-devel] Migration update

From: William P. <wil...@ya...> - 2010-03-03 21:47:21

On Mar 3, 2010, at 3:20 PM, Vladimir Gapeyev wrote:

> Bill, I'd like to check with you on the number of new matrices and  
> trees that were expected to be in the delta. The import tools are  
> written to skip files with the names that are already in the  
> database.  So, they uploaded about 590 new matrices and 720 new trees  
> (compared, respectively, to 4348 files in the characters directory and  
> 5151 files in the trees directory).  Does this outcome look about  
> right?  That is, did the data directories contain files that were  
> loaded into the database earlier and did you NOT expect for them to  
> affect the database?

So, this dump file:

http://www.treebase.org/treebase/migration/Dec-09/dump_Dec09_utf8.txt

...only makes reference to a subset of the data found here /Dec-09/characters/ ...and here... /Dec-09/trees/

I had assumed that the way the migration scripts would work is that they would read the dump file, and then only import each matrix or tree as needed (or instructed) by the dump file. i.e., the migration scripts need not have skipped over anything because dump_Dec09_utf8.txt has already done that for you.

But I guess in actual fact the migration scripts work differently -- I'm guessing that they first upload all new matrices and trees, and only afterwords wire them together in their proper study record after parsing the dump file. 

The December 09 dump file contains instructions to upload 284 studies, 560 matrices, and 714 trees.  So it's a bit odd that the migration scripts decided that there were 590 matrices and 720 trees to upload. This means that there are 30 matrices and 6 trees that will be uploaded, yet there is no info in the dump file about what studies or analyses they belong to. If you can save a list of these "orphaned" matrices and trees, I can look into what study they should belong to. 

bp

[Treebase-devel] Migration update

From: Vladimir G. <vga...@ne...> - 2010-03-03 20:36:15

I am making progress on migrating data to the production instance, but  
it is going significantly slower than estimated.  Just loading  
matrices and trees took about 26 hours of pure running time instead of  
the 12 hours for the whole job that I estimated.  (It appears that my  
assumption of linear scaling was wrong.  I fear that the cost of  
uploading a new file is somehow dependent on the size of the database,  
as operations against the production instance look slower than against  
my empty testing instance.)  These 26 hours do not include times for  
manual checks and fix-ups that I had to do.

It can well take a couple more days from now to get through, provided  
tools performance and the rate at which problems show up and get  
mitigated remain the same.  (I am currently tracking down a hiccup  
that may prove trickier -- I'll follow up if so.)

Bill, I'd like to check with you on the number of new matrices and  
trees that were expected to be in the delta. The import tools are  
written to skip files with the names that are already in the  
database.  So, they uploaded about 590 new matrices and 720 new trees  
(compared, respectively, to 4348 files in the characters directory and  
5151 files in the trees directory).  Does this outcome look about  
right?  That is, did the data directories contain files that were  
loaded into the database earlier and did you NOT expect for them to  
affect the database?

--Vladimir

Re: [Treebase-devel] treebasedev DB going down - back by 2pm ETA

From: Vladimir G. <vla...@du...> - 2010-03-01 22:22:47

Apologies for the delay...
Treebase-dev have been restored (from treebase-stage), and the web app  
restarted.  It is up-to-date w.r.t. commits done Monday.
--Vladimir


On Mar 1, 2010, at 1:18 PM, Vladimir Gapeyev wrote:

> FI: We are going to stop and restore treebasedev DB instance in a few
> minutes.
> I'll let you know when it is done.
>
> [Sloppily, I ran the initialization script against it, which must have
> messed up many sequences.]
>
> --VG
>
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Treebase-devel mailing list
> Tre...@li...
> https://lists.sourceforge.net/lists/listinfo/treebase-devel

9 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 36 37 38 39 40 .. 63 > >> (Page 38 of 63)