You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
(1) |
Apr
(41) |
May
(41) |
Jun
(50) |
Jul
(14) |
Aug
(21) |
Sep
(37) |
Oct
(8) |
Nov
(4) |
Dec
(135) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(145) |
Feb
(110) |
Mar
(216) |
Apr
(101) |
May
(42) |
Jun
(42) |
Jul
(23) |
Aug
(17) |
Sep
(33) |
Oct
(15) |
Nov
(18) |
Dec
(6) |
2011 |
Jan
(8) |
Feb
(10) |
Mar
(8) |
Apr
(41) |
May
(48) |
Jun
(62) |
Jul
(7) |
Aug
(9) |
Sep
(7) |
Oct
(11) |
Nov
(49) |
Dec
(1) |
2012 |
Jan
(17) |
Feb
(63) |
Mar
(4) |
Apr
(13) |
May
(17) |
Jun
(21) |
Jul
(10) |
Aug
(10) |
Sep
|
Oct
|
Nov
|
Dec
(16) |
2013 |
Jan
(10) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
(5) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(5) |
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
From: Hilmar L. <hl...@du...> - 2009-04-21 21:48:11
|
We can work on the software and data migration tasks in parallel to hardware, absolutely. -hilmar On Apr 21, 2009, at 5:03 PM, Val Tannen wrote: > > On Apr 21, 2009, at 4:58 PM, Hilmar Lapp wrote: > >> >> >> On Mon, Apr 6, 2009 at 7:34 AM, Hilmar Lapp <hl...@du...> wrote: >>> [...] >> >>> 3. Hardware purchase & OS, software setup >> >> Just as an update that most (though possibly not all) of you will >> already know, the outside-service agreement with NESCent is in >> place, and we have invoiced. As soon as we have the funds we will >> purchase the hardware. > > Does this mean we wait with any transfer of data until the hardware > arrives and is set up? > I must tell you that we may have been lucky that nothing went too > wrong with SDSC's installation > so far and the faster we become independent of that the better... > > Val > > > > > > >> >> >> -hilmar >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : >> =========================================================== >> >> >> >> > > > ------------------------------------------------------------------------------ > Stay on top of everything new and different, both inside and > around Java (TM) technology - register by April 22, and save > $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. > 300 plus technical and hands-on sessions. Register today. > Use priority code J9JMT32. http://p.sf.net/sfu/p > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== |
From: Val T. <va...@ci...> - 2009-04-21 21:17:02
|
On Apr 21, 2009, at 4:58 PM, Hilmar Lapp wrote: > > > On Mon, Apr 6, 2009 at 7:34 AM, Hilmar Lapp <hl...@du...> wrote: >> [...] > >> 3. Hardware purchase & OS, software setup > > Just as an update that most (though possibly not all) of you will > already know, the outside-service agreement with NESCent is in > place, and we have invoiced. As soon as we have the funds we will > purchase the hardware. Does this mean we wait with any transfer of data until the hardware arrives and is set up? I must tell you that we may have been lucky that nothing went too wrong with SDSC's installation so far and the faster we become independent of that the better... Val > > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : > =========================================================== > > > > |
From: Hilmar L. <hl...@du...> - 2009-04-21 21:02:59
|
On Apr 21, 2009, at 4:56 PM, Rutger Vos wrote: > Hilmar, > >>> and subsequently, in broad strokes: >>> 3. Hardware purchase & OS, software setup > > I'm curious about this step: what does it involve in practical terms > to get to the point where I can ssh into > rv...@tr... (or some such)? Purchase & delivery of the hardware, virtualization environment to be set up, virtual slices to be created, OS installed and imaged, accounts to be created. Jon would have more details. We'll also be doing testing of the host slices using our own development sites, and we'll be looking this and next week whether we can fast-track some of the hardware purchases so we can start testing earlier. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== |
From: Hilmar L. <hl...@du...> - 2009-04-21 20:58:18
|
On Mon, Apr 6, 2009 at 7:34 AM, Hilmar Lapp <hl...@du...> wrote: > [...] > 3. Hardware purchase & OS, software setup Just as an update that most (though possibly not all) of you will already know, the outside-service agreement with NESCent is in place, and we have invoiced. As soon as we have the funds we will purchase the hardware. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== |
From: Rutger V. <rut...@gm...> - 2009-04-21 20:56:43
|
Hilmar, >> and subsequently, in broad strokes: >> 3. Hardware purchase & OS, software setup I'm curious about this step: what does it involve in practical terms to get to the point where I can ssh into rv...@tr... (or some such)? Rutger -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |
From: Rutger V. <rut...@gm...> - 2009-04-21 20:53:12
|
This is Hilmar's earlier thread on the same topic Val just brought up: migration to NESCent. Let's work off of this, it's more detailed. On Mon, Apr 6, 2009 at 7:34 AM, Hilmar Lapp <hl...@du...> wrote: > Hi all, > > to start us off on planning and coordinating the next steps for the > TreeBASE migration I'll repost part of an email I've sent to most of > the people on this list previously, but which hasn't had any serious > follow-up discussion yet. As we agreed, this would be the time and the > place to have that discussion. > > Broadly, the next thing to work on would be moving the source code > from the SDSC repository to SourceForge and have all developers work > off of that, so that we can then branch off for the migration work. > > As for the actual planning, in principle here is the list of things I > see on our plate to sort out: > > 1. Status update > 1a. middleware and UI source code, outstanding bugs > 1b. database schema definition > 1c. data migration from TB1 and testing of result > 1d. unit testing of TB2 code > 1e. user & usability testing of TB2 UI > 2. Moving source code to Sf.net > 2a. licensing cleanup issue > 2b. import of code base into sf.net svn > 2c. switch development repositories, make SDSC repository read-only > 2d. content pages (project homepage, documentation) > 3. Procedures > 3a. regular activity updates > 3b. schema changes > 3c. DAO and middleware API changes > > and subsequently, in broad strokes: > 3. Hardware purchase & OS, software setup > 4. Schema migration > 5. Data migration script > 6. Software migration > 7. Testing > 8. Flipping the switch > > Any and all thoughts and feedback appreciated. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : > =========================================================== > > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |
From: William H. P. <pi...@tr...> - 2009-04-07 14:03:35
|
[my first attempt to send this got quarantined by the list because of my sending address was wrong -- sorry if you get this twice] On Apr 7, 2009, at 1:15 AM, Rutger Vos wrote: > These messages do NOT persist between sessions, i.e. when a user logs > out and logs back in again, the messages are gone. The only way to fix > that would be to add a column to the taxon labels table that can > persistently flag labels as unresolved homonyms. I personally think > that that would be excessive, so I have closed the bug report. Agreed? agreed. On Apr 7, 2009, at 2:24 AM, Hilmar Lapp wrote: >> [...] when users submit any nexus file containing taxon labels (a >> tree file, a >> matrix file, or a combination thereof), TreeBASE initially flags >> these >> labels as not having been linked to an external taxonomy (i.e. uBio >> and NCBI). The user is then expected to go the page with taxon >> labels, >> check all labels flagged as not linked (they have a red cross mark >> next to them) and hit "validate". > > I.e., if the user never goes to that page (or goes to the page but > never takes any action), the taxon labels will never be validated? yes. > In other words, whether TreeBASE has taxon name service-validated > taxon labels for a record is in the hands of the user submitting the > record? yes -- it's at the discretion of the editor whether to approve a submission that contains unmapped names. We can't oblige all names to always be mapped because not all names can be mapped, so we leave it up the the submitter to make an effort, and the editor can decide whether an honest effort has been made. >> TreeBASE will then attempt to look up these labels in uBio - and from >> there in the NCBI taxonomy. In some cases, it will turn out that a >> label is a true homonym, i.e. multiple, actual taxa by that name >> exist >> - usually a plant taxon and an animal taxon (examples: Aotus, >> Abronia). TreeBASE will pick the first option of the list of homonyms >> and warn the user that it did this, urging the user to check by hand >> whether that choice is the right one. > > So in case of homonyms, TreeBASE will basically pick one at random, > and leave it to the user to correct it where it picks wrong? I.e., > if the user doesn't make the corrections, TreeBASE will have > incorrect data after this step 50% or more of the time? Yes. The presence of a homonym is, in theory, a fairly rare event. One solution (for the future) is to make use of the mapping with the ncbi tables and create little gifs for major groups: arthropods, mollusks, birds, fish, mammals, etc etc, and then show these little gifs in the list of taxon labels. A cursory glance down the list would make it easy to see where things are wrong -- i.e. if a little flower crops up in a list full of mammals, you know something is likely to be wrong. > I don't know enough details, but if this is true, we can probably > agree that 1) TreeBASE shouldn't make data incorrect by the > submission or curation UI, let alone store data that it made > possibly incorrect, This process does not modify the submitted data (= the cake). It's adding icing on the cake so that when users are searching TreeBASE, it's easier to find the cake that they want. The cake remains unadulterated, even if the icing is mismatched. > and 2) if TreeBASE alters the user's data in non-trivial ways (i.e., > other than changing formatting etc), then those changes should be > traceable and the original values retained, so that they can be > reviewed and corrected given better knowledge. And this is as it is. The only "original values" that can be supplied by a nexus are the taxon labels. Those don't change unless the submitter changes them by hand. The mapping of each label to uBio or NCBI is metadata enhancement that was not part of the original data. >> [...] The only way to fix that would be to add a column to the >> taxon labels table that can persistently flag labels as unresolved >> homonyms. I personally think >> that that would be excessive, so I have closed the bug report. >> Agreed? > > So, three thoughts here for TreeBASE's (as a project community) > consideration. > > 1) An issue is open until it is either fixed, or the stakeholders > decide to reject the issue for whatever reason. One of the reasons > for rejection might be excessive effort needed to fix the issue, but > developers shouldn't make the decision. (As a developer, you really > don't want to spend your time debating with users whether a certain > amount of effort is excessive or not.) At this point, "excessive effort" is a reason for rejection because we really need to get something out the door soon. > Who are the TreeBASE stakeholders for the purposes of development > prioritization and issue rejection? The four of us (Val, Rutger, Mark, and I) have been collectively ranking bugs and features as either pre-beta, post-beta, or for some time in the future -- though I contribute more about how important a bug fix or feature is to the user experience and mission of the database, while Rutger and Mark contribute more about feasibility, and Val keeps us all marching forward. > 2) Each feature needs clear and precise requirements. A bug normally > indicates that a certain requirement of the system is not met. A bug > is not fixed until the requirements violations being reported aren't > fixed. If the requirements of a feature aren't spelled out clearly > to begin with, it becomes difficult later to determine whether an > issue reported with the feature is really a new feature request or a > requirements violation. (An issue report that really is a new > feature request is still a valid issue report. However, it is often > prioritized differently than fixing an incorrectly implemented > feature whose requirements have previously been agreed upon.) > > 3) Stakeholders sign off on requirements. (They may be suggested by, > but are not unilaterally declared by developers.) > > In this case, if the requirements of the taxon label validation > against a taxon name service feature have been agreed upon to > include that the "resolved" taxon label being stored not be wrong > due to homonymy, than anything short of guaranteeing that is an > incorrect implementation that needs to be fixed. > > So what are the requirements for that feature? Well, we've done it is in two ways: I describe the general behavior and requirements (e.g. in the message below); and sometimes (e.g. in this case) I'd make a mock-up in Perl and let the developers implement this is Java. In my Perl mockup, the homonym resolution is shown on the same page as the list of taxon labels -- so for me this wasn't an issue. Also, I think my default for homonyms was "no match" -- forcing any mapping to be user-initiated. Because the Java developers (mainly Madhu at SDSC) chose to have the homonym resolution occur on individual pages (not the list) then homonym mapping can get lost the the viewer, and this became an issue that otherwise did not reveal itself in the "requirements" mockup. bp Feb 22 2008: Begin forwarded message: > Here's the thing: in an ideal world, we would take a taxon label > (e.g. "Homo sapiens Frenchman343"), send it to uBIO's web services, > and uBIO would "resolve" it to a single taxon name (e.g. "Homo > sapiens"), and return to us a unique id for that taxon (e.g. 12345). > Then we would simply see if that id already exists in our taxa > table. If it does, we map the "Homo sapiens Frenchman343" > taxon_label record to the "Homo sapiens" taxa record; or, if we > don't have a taxa record with id 12345, we would create a new one > and then map the taxon_label record to it. Simple. > > So that's in an ideal world -- and that was the solution that uBIO's > programmers seemed to suggest back when we designed the separate > taxon_labels and taxa tables. But no such luck -- it turns out that > uBIO does not have a "preferred name" -- all it can really do is > recognize the string "Homo sapiens", return a namebankid for that > taxon variant, and on a good day return a list of other name > variants that are linked to this name (e.g. lexical variants and > homotypic synonyms) -- but it does not have a "group ID" to identify > this collection of name variants. So for example, author A submits > a tree with the label "Homo sapiens Linnaeus, 1758" which resolves > to namebankid 109086, author B submits a tree with label "Homo > sapiens" but that resolves to namebankid 2481730. Two different IDs > for what are two different name variants -- but both really refer to > the same species. Yet there is no "species id" or "group id" to > indicate their sameness, and there is no reliable way of deciding > which name to use as the name to represent the species in our taxa > table. Bummer. > > Now, if you examine the namebank object for either variant, you'll > discover that each cites the other as another name variant -- so > retrieving the namebank object for either one allows us to collect > all other variants. Also, uBIO has mapped each taxon group with > other taxonomic databases out there, for example ncbi. > > So here's the solution: for each taxon label, we match it against > uBIO and retrieve a namebank object for it -- from there we collect > all name variants (which we store in a different table from our taxa > table) and we also collect the ncbi taxid and ncbi's preferred name > (which we store in the taxa table). We're using the taxid to > establish the "oneness" of the taxa table -- if there's no taxid, we > will scan all the variants in search of one that uBIO has labeled > "canonical form" and use that string and it's namebankid in our taxa > table. > > There are a couple of potential snags with this. One is that uBIO's > classificationbank (which stores various classifications, including > ncbi's) has a nasty habit of substituting their own id numbers and > not delivering ncbi's id numbers. But we need ncbi's ids for our > higher taxon searching. So how else do we get the ncbi ids? -- well, > taxonFinder has a switch that causes it to list "outlinks", > including urls to ncbi. We can extract ncbi ids from there. > Unfortunately, this is unreliable -- it doesn't always report ncbi > links, even when it should. The most consistent way to recover ncbi > taxids is to scrape the uBIO web page. > > But here's the catch: believe it or not, uBIO does not properly > disambiguate between gross homonyms. These are instances where one > name means two completely different species, and it can happen > because the names of plants, animals, and bacteria are governed by > different bodies -- so they can stomp on each other's names. So for > example, if I go to the uBIO web page and search for "Aotus", I get > back a single result. If you look at that page, you'll notice that > it says "Aotus" is a "synonym" of both "Aotus Illiger, 1811" and > "Aotus J. E. Smith 1805", which means that the name has two > different authors. Turns out Illiger was describing an owl monkey in > 1811 while J. E. Smith was describing a bean plant. The only hint > that we have two different species, is in the outlinks provided by > taxonFInder: if there are two ncbi ids, then there is a possible > homonym. My solution is to use ncbi ids as a flag for a possible > homonym problem, but I'm forced to list all taxon variants under > each id -- let the author make the best choice, and we can later > delete the duplicate taxon variants so that different sets of > variants link to different records in the taxa table. > > I hope it's not been too confusing. I'll step-by-step go through an > example of how this works. First, here are the tables that I'm using: > > CREATE TABLE taxa ( > taxon_id integer NOT NULL, > namebankid integer, -- from uBIO > namestring character varying(255), -- the "preferred" name > for the species > taxid integer, -- from ncbi > groupcode integer -- (not used yet - I'm > thinking of a kingdom code of such) > ); > > CREATE TABLE taxon_variants ( > taxon_variant_id integer NOT NULL, > taxon_id integer, > namebankid integer, -- from uBIO > namestring character varying(255), -- uBIO's name variant > (short style) > fullnamestring character varying(255), -- uBIO's name variant > (long style) > lexicalqualifier character varying(30) -- uBIO's qualifier (e.g. > "canonical form") > ); > > CREATE TABLE taxon_labels ( > taxon_label_id integer NOT NULL, > taxon_variant_id integer, > legacy_id character varying(10), -- this is TreeBASE's old > taxon label id > study_id integer, -- this table is scoped > to the study > taxon_label character varying(255) -- the label for the leaf > node > ); > > ... each "taxa" record relates to one or more "taxon_variants" > records; each "taxon_variants" record relates to many "taxon_labels" > records. The uBIO web services are described here. This is a > special release that is best for our purposes: we need to use the "http://www.ubio.org/webservices/service_internal.php > " services not the standard "http://www.ubio.org/webservices/service.php > " ones. > > For example, suppose that an author submits a tree that has a taxon > label called "Tetrao afer AB2342". The "AB2342" is just a suffix > code of some sort. > > Step 1. We should try to match this name against fullnamestring in > the taxon_variants table, because if there is a match then we don't > have to use uBIO's web services. I suggest matching with in two > ways: (1) a direct match, and (2) looking to see if there are three > or more "words" and then test whether the third word has at least > one digit it it. If it is, then chop off everything after the second > word, and search on that -- i.e. search on "Tetrao afer". If the > search is successful, then we're done -- map the taxon_label record > to the found record. > > Let's assume that there is no match. > > Step 2. Without a match with an existing record, next we should use > taxonFinder to try to detect any taxon names in the string "Tetrao > afer AB2342". In this case, taxonFinder can easily locate "Tetrao > afer" -- but authors often forget to separate the species epithet > from a number or letter code. For example, "Tetrao aferAB2342" will > cause taxonFinder to only locate "Tetrao" -- which would be wrong. > So my suggestion is that the string that we throw at taxonFinder > should be modified so that any letter followed by a digit be > separated by a space; and any lowercase letter followed by an upper > case letter also be separated by a space. In the case of "Tetrao > aferAB2342" we would first modify it into "Tetrao afer AB 2342". Now > taxonFinder can easily find "Tetrao afer". Here's what the URL looks > like: > > http://www.ubio.org/webservices/service_internal.php?function=taxonFinder&includeLinks=1&freeText=Tetrao+afer+AB+2342&version=2.0 > > Follow that, and you'll get an XML that reports the namebankid -- > which in this case is 2576335. > > Unfortunately the links to ncbi are not included in this result, > even though for most records they are. For example, here's the > result for Homo sapiens, which shows the link to ncbi taxid 9606. > If we detect an ncbi link returned by taxonFinder, we should keep it > for use in our taxa entry. > > Step 3. If taxonFinder fails to return an ncbi taxid (as it did for > Tetrao afer), the next step is to scrape uBIO's web page. For > example, if you follow this url: > > http://www.ubio.org/browser/details.php?namebankID=2576335 > > ... you'll notice that the ncbi link is present there. The ncbi > taxid turns out to be 389023, and if you follow this ncbi link, you > discover that ncbi's preferred name is "Pternistis afer". > > Step 4. Next, we will use this namebankid (2576335) to fetch the > namebank object. Here's what the url looks like: > > http://www.ubio.org/webservices/service_internal.php?function=namebank_object&namebankID=2576335&version=2.0&keyCode=2c6d5eccba2627906481774fdcb60669c2ebee72 > > The keyCode is a special number that identifies me as the user (you > have to register to get one). This returns a long XML full of stuff. > At the top level, we see: <nameString>Tetrao afer</nameString> and > <canonicalForm>Tetrao afer</canonicalForm>. Then there are two > important couplets: the <lexicalGroups> section and the > <basionymGroup> section -- this sections list different name > variants. Each entry has it's own namebankid. My strategy is to > collect all namebankids and treat them as different name_variants. > So for each there are four bits of information: the namebankid, the > nameString, the fullNameString, and the lexicalQualifier. For the > top level entry, I first treat the <nameString> as a > <fullNameString> and the <canonicalForm> as a <nameString>. So here > are all the names retrieved from the namebank object: > > 2576335 > Tetrao afer > Tetrao afer > NULL > > 275422 > Tetrao afer > Tetrao afer PLS Müller 1776 > unknown (Default) > > 11817 > Pternistis afer > Pternistis afer (Statius Muller) 1776 > NULL > > 12294 > Francolinus afer > Francolinus afer (PLS Müller 1776) > NULL > > 23417 > Francolinus afer afer > Francolinus afer afer (PLS Müller 1776) > NULL > > 274343 > Pternistis afer afer > Pternistis afer afer (PLS Müller 1776) > NULL > > 275422 > Tetrao afer > Tetrao afer PLS Müller 1776 > NULL > > 1559020 > Pternistes afer afer > Pternistes afer afer (PLS Müller 1776) > NULL > > 1762020 > Pternistes afer > Pternistes afer > NULL > > 2475119 > Francolinus afer afer > Francolinus afer afer > NULL > > 2576309 > Pternistis afer afer > Pternistis afer afer > NULL > > 3345669 > Pternistes afer afer > Pternistes afer afer > NULL > > ... so in this case there are 12 different name variants. Since I > know that ncbi prefers the name "Pternistis afer", I'm going to > match this against the <fullNameString> of all the variants, and in > this way I discover that the best namebankid to use is 1762020. If I > didn't have an ncbi preferred name, I would look for a > <lexicalQualifier> that says "canonical form" (in this case none of > them do -- most have NULL -- but that's unusual: most namebank > objects provide this info). If there's no ncbi preferred name, and > if there's no canonical form qualifier, then I'd pick the top level > name, in this case 2576335. > > That's the procedure. It's a bit convoluted, but I've implemented > all this here: > > http://www.waterflea.org/tbn/ > > the password is "sdsc". There are two databases: test database and > real database. Test database is just for fooling around and testing > different names. Feel free to test all kinds of strings in there. > The "real" database is for me to put all the legacy TreeBASE taxon > labels in so that I can match them against real names. Mark can > click the "Dump Data" button to get a download of all these mappings > -- and then use them in the data migration process. > > You can play with the Test database. When you enter the password > ("sdsc") and click "Enter Database", next you'll see a box to paste > in some names. I've "pre-entered" some old TreeBASE taxon labels, > each followed by a tab and followed by a legacy TreeBASE taxon_id. > Click "Submit Names". Now you see a list of matches between the > taxon label and a taxon variant taken from uBIO. > > So imagine you're a submitting author -- all the labels in your > trees wind up in a table like this, get matched against uBIO (etc), > and then for each name you have to select a radio button. If the > match is a false match, you can always select "No match". If > taxonFinder can't find any match, there will be no corresponding > uBIO namebankIDs -- and in that case the author is encouraged to > check that there is no misspelling. > > By clicking "Submit Matches", the matches between legacy TreeBASE > ids and their respective taxon variants are recorded in the > taxon_labels table, which is then available to Mark when he clicks > the "Dump Data" button. > > You notice that with this list of names, the matching is almost > instantaneous -- that's because I had already clicked the "Submit > Names" button before, so all these variants had already been sucked > over from uBIO. Since the first step is to match the taxon_labels > against the existing local database, then we never get to the point > of web scraping uBIO. To properly see this in action, you'll need > to a new set of taxon labels. Feel free to do this, but be careful > not to paste in a list that is too long, else the latency will be > too great (each is tested against uBIO sequentially, so longer lists > take longer to return). > > So ... how are you guys to implement this in TreeBASE2?? > > Where: It has to be used in two places: one is in the taxon summary > section, which lists all the taxon labels from all the matrices and > trees. The other place is where authors have created row segments -- > in cases where they want to assign different species to different > row segments. > > How: One option is to take my perl dribble and rewrite it in java. > The thing is, I'm worried that these web services are quite > "dynamic" -- meaning they could change on us at any moment.. so this > functionality needs to be designed so that it can be easily and > quickly modifed. I don't know what you guys think, but my thought is > that perhaps this should be written in perl. Each time TreeBASE2 > needs taxonomic intelligence, the java talks to a perl script and > then the perl script does all the web scraping etc. What do you > think? > > Issues: > > - my perl deals with each name sequentially. This is rather slow -- > we need to do multiple names in parallel (but maybe not too many, > not to overwhelm uBIO... How about a spawning system where ten names > go off at a time, each separated by a millisecond?) > - I did not design-in how to handle time-outs (e.g. moments when > uBIO or ncbi is unavailable) > - obviously, my code is riddled with security holes > - uBIO offers SOAP as well as regular HTTP API retrieval (I used > HTTP) -- maybe you'd prefer SOAP? > - ntsc uses a web API -- but note that if we're going to keep an up- > to-date local copy of the ntsc taxonomy tree (for higher name > searching) we could query that instead -- it would be much faster. > > > Attached, please find my perl, web page templates, and the database > structure written in sql (postgres flavored). The main script that > does all the taxonomic intelligence is called parse.pl. I hope I put > enough comments for you to figure out how it works. > > later, > > Bill > |
From: Hilmar L. <hl...@du...> - 2009-04-07 06:24:37
|
At the risk of starting a fire, a few thoughts from me below. On Apr 7, 2009, at 1:15 AM, Rutger Vos wrote: > [...] when users submit any nexus file containing taxon labels (a > tree file, a > matrix file, or a combination thereof), TreeBASE initially flags these > labels as not having been linked to an external taxonomy (i.e. uBio > and NCBI). The user is then expected to go the page with taxon labels, > check all labels flagged as not linked (they have a red cross mark > next to them) and hit "validate". I.e., if the user never goes to that page (or goes to the page but never takes any action), the taxon labels will never be validated? In other words, whether TreeBASE has taxon name service-validated taxon labels for a record is in the hands of the user submitting the record? > TreeBASE will then attempt to look up these labels in uBio - and from > there in the NCBI taxonomy. In some cases, it will turn out that a > label is a true homonym, i.e. multiple, actual taxa by that name exist > - usually a plant taxon and an animal taxon (examples: Aotus, > Abronia). TreeBASE will pick the first option of the list of homonyms > and warn the user that it did this, urging the user to check by hand > whether that choice is the right one. So in case of homonyms, TreeBASE will basically pick one at random, and leave it to the user to correct it where it picks wrong? I.e., if the user doesn't make the corrections, TreeBASE will have incorrect data after this step 50% or more of the time? > [...] I have now implemented the following: a map (keys: taxon label > IDs, values: taxon label strings) called "homonyms" is created > during taxon label validation and carried around the session. Every > time a homonym is manually resolved, that > entry is deleted from the session. That doesn't solve the above, does it? Specifically, because of: > [...] These messages do NOT persist between sessions is the fact that some of the taxon labels are now wrong lost if the user chooses not to complete the homonym corrections? I don't know enough details, but if this is true, we can probably agree that 1) TreeBASE shouldn't make data incorrect by the submission or curation UI, let alone store data that it made possibly incorrect, and 2) if TreeBASE alters the user's data in non-trivial ways (i.e., other than changing formatting etc), then those changes should be traceable and the original values retained, so that they can be reviewed and corrected given better knowledge. (For example, a synonym assignment in uBio might turn out wrong a month later.) > [...] The only way to fix that would be to add a column to the taxon > labels table that can persistently flag labels as unresolved > homonyms. I personally think > that that would be excessive, so I have closed the bug report. Agreed? So, three thoughts here for TreeBASE's (as a project community) consideration. 1) An issue is open until it is either fixed, or the stakeholders decide to reject the issue for whatever reason. One of the reasons for rejection might be excessive effort needed to fix the issue, but developers shouldn't make the decision. (As a developer, you really don't want to spend your time debating with users whether a certain amount of effort is excessive or not.) Who are the TreeBASE stakeholders for the purposes of development prioritization and issue rejection? 2) Each feature needs clear and precise requirements. A bug normally indicates that a certain requirement of the system is not met. A bug is not fixed until the requirements violations being reported aren't fixed. If the requirements of a feature aren't spelled out clearly to begin with, it becomes difficult later to determine whether an issue reported with the feature is really a new feature request or a requirements violation. (An issue report that really is a new feature request is still a valid issue report. However, it is often prioritized differently than fixing an incorrectly implemented feature whose requirements have previously been agreed upon.) 3) Stakeholders sign off on requirements. (They may be suggested by, but are not unilaterally declared by developers.) In this case, if the requirements of the taxon label validation against a taxon name service feature have been agreed upon to include that the "resolved" taxon label being stored not be wrong due to homonymy, than anything short of guaranteeing that is an incorrect implementation that needs to be fixed. So what are the requirements for that feature? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== |
From: Rutger V. <rut...@gm...> - 2009-04-07 05:16:00
|
Hi, issue 2712225 is, cryptically, about the following issue: > 2- The warning about "had multiple hits - defaulted to first hit > **VALIDATE BY HAND**" is good, except once I try to validate one of > them, this message disappears, so I don't know which other ones still > need to be validated by hand. Is there a way to continue posting this > flag each time the page is revisited? (https://sourceforge.net/tracker/index.php?func=detail&aid=2712225&group_id=248804&atid=1126676) As this is the first post to treebase-devel, here's some context: when users submit any nexus file containing taxon labels (a tree file, a matrix file, or a combination thereof), TreeBASE initially flags these labels as not having been linked to an external taxonomy (i.e. uBio and NCBI). The user is then expected to go the page with taxon labels, check all labels flagged as not linked (they have a red cross mark next to them) and hit "validate". TreeBASE will then attempt to look up these labels in uBio - and from there in the NCBI taxonomy. In some cases, it will turn out that a label is a true homonym, i.e. multiple, actual taxa by that name exist - usually a plant taxon and an animal taxon (examples: Aotus, Abronia). TreeBASE will pick the first option of the list of homonyms and warn the user that it did this, urging the user to check by hand whether that choice is the right one. Issue 2712225 complained that this warning disappears after the first time the user viewed the validation results page. In the case of multiple homonyms (e.g. when a nexus file contained both Aotus and Abronia) this is problematic because now the user needs to remember which labels (potentially from a very large list) to resolve. In follow up discussion, Bill clarified the issue as follows: > So here's the problem: the user does a "Validate" and gets back a list of > "**VALIDATE BY HAND**" messages. He picks the first one to validate by > hand, fixes that, then goes back to the list to pick the next one that > needs validating by hand. But now the warning messages are gone, and he > can't possibly remember which other ones still need hand validation. > Ideally the messages would stick around until each case that requires > special attention is dealt with -- but I can understand that we don't want > to constantly trigger re-validation. One solution is that any label that > has multiple mappings gets a little "M" symbol (or whatever) in a column on > the Taxon Labels Information page -- that always gets displayed whether > hand validation has been used or not (it's just a property of the row). > Another solution is that when there are VALIDATE BY HAND messages, these > get posted to a pop-up window which can stick around while the user is > correcting each one in the main page. The third solution is for me to try > to give sufficiently clear instructions that the user should copy/paste the > VALIDATE BY HAND messages to a separate text document, so as to save these > warnings until all have been fixed. As a fix for this as of revision 6382, I have now implemented the following: a map (keys: taxon label IDs, values: taxon label strings) called "homonyms" is created during taxon label validation and carried around the session. Every time a homonym is manually resolved, that entry is deleted from the session. This means that warning messages about unresolved homonyms on the taxon labels screen now persist for the duration of the session - as opposed to disappearing after the first viewing. Every time a homonym is resolved by hand = one less nagging warning message at the top of the taxon labels screen. These messages do NOT persist between sessions, i.e. when a user logs out and logs back in again, the messages are gone. The only way to fix that would be to add a column to the taxon labels table that can persistently flag labels as unresolved homonyms. I personally think that that would be excessive, so I have closed the bug report. Agreed? Rutger |
From: Rutger V. <rut...@gm...> - 2009-04-06 22:05:18
|
> I would actually advise against it. It would make public any user or account > names, host names, and passwords that were ever mistakenly committed to the > repository. Good point. > You can (and in fact should) still archive a complete dump of the current > repository at the time of switching to Sf.net in the event that you want to > go back later and find out about who originated a piece of code or to > retrieve a file that used to be there but was deleted later. Ah, yes. Let's do that. -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |
From: Hilmar L. <hl...@du...> - 2009-04-06 21:57:52
|
On Apr 6, 2009, at 5:48 PM, Rutger Vos wrote: > I believe this can be done It can be done, there's a tool to export and one to import the svn repository. It requires admin help, though, i.e., we'd be dependent on SourceForge support staff and their prioritization of time to assist us with doing this. > but I'm not sure and I don't know how to do it. I would actually advise against it. It would make public any user or account names, host names, and passwords that were ever mistakenly committed to the repository. I'd advise to start with a clean slate that we have convinced ourselves is free of cruft (at least as far as entire files are concerned), free of information that would make some system vulnerable to security breach, and free of bogus, obsolete, or inapplicable license information. You can (and in fact should) still archive a complete dump of the current repository at the time of switching to Sf.net in the event that you want to go back later and find out about who originated a piece of code or to retrieve a file that used to be there but was deleted later. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== |
From: Rutger V. <rut...@gm...> - 2009-04-06 21:48:17
|
Does anyone have any experience with moving a project to sourceforge while retaining its revision history? I believe this can be done but I'm not sure and I don't know how to do it. On Mon, Apr 6, 2009 at 7:34 AM, Hilmar Lapp <hl...@du...> wrote: > Hi all, > > to start us off on planning and coordinating the next steps for the > TreeBASE migration I'll repost part of an email I've sent to most of > the people on this list previously, but which hasn't had any serious > follow-up discussion yet. As we agreed, this would be the time and the > place to have that discussion. > > Broadly, the next thing to work on would be moving the source code > from the SDSC repository to SourceForge and have all developers work > off of that, so that we can then branch off for the migration work. > > As for the actual planning, in principle here is the list of things I > see on our plate to sort out: > > 1. Status update > 1a. middleware and UI source code, outstanding bugs > 1b. database schema definition > 1c. data migration from TB1 and testing of result > 1d. unit testing of TB2 code > 1e. user & usability testing of TB2 UI > 2. Moving source code to Sf.net > 2a. licensing cleanup issue > 2b. import of code base into sf.net svn > 2c. switch development repositories, make SDSC repository read-only > 2d. content pages (project homepage, documentation) > 3. Procedures > 3a. regular activity updates > 3b. schema changes > 3c. DAO and middleware API changes > > and subsequently, in broad strokes: > 3. Hardware purchase & OS, software setup > 4. Schema migration > 5. Data migration script > 6. Software migration > 7. Testing > 8. Flipping the switch > > Any and all thoughts and feedback appreciated. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : > =========================================================== > > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |
From: Hilmar L. <hl...@du...> - 2009-04-06 14:34:47
|
Hi all, to start us off on planning and coordinating the next steps for the TreeBASE migration I'll repost part of an email I've sent to most of the people on this list previously, but which hasn't had any serious follow-up discussion yet. As we agreed, this would be the time and the place to have that discussion. Broadly, the next thing to work on would be moving the source code from the SDSC repository to SourceForge and have all developers work off of that, so that we can then branch off for the migration work. As for the actual planning, in principle here is the list of things I see on our plate to sort out: 1. Status update 1a. middleware and UI source code, outstanding bugs 1b. database schema definition 1c. data migration from TB1 and testing of result 1d. unit testing of TB2 code 1e. user & usability testing of TB2 UI 2. Moving source code to Sf.net 2a. licensing cleanup issue 2b. import of code base into sf.net svn 2c. switch development repositories, make SDSC repository read-only 2d. content pages (project homepage, documentation) 3. Procedures 3a. regular activity updates 3b. schema changes 3c. DAO and middleware API changes and subsequently, in broad strokes: 3. Hardware purchase & OS, software setup 4. Schema migration 5. Data migration script 6. Software migration 7. Testing 8. Flipping the switch Any and all thoughts and feedback appreciated. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== |
From: Hilmar L. <hl...@du...> - 2009-04-02 16:28:30
|
Welcome to this mailing list - being neither a TreeBASE developers, nor really a TreeBASE user, nor one of the brains behind it, I feel somewhat odd to be making this first post, but nonetheless and for future records, on behalf of the TreeBASE project I'd like to warmly welcome everyone to this list who is here now or joins in the future. To my knowledge for the first time, the TreeBASE code base is moving to a public source code repository, for now hosted at SourceForge, where everyone can look at the code, reuse it, and help in fixing its problems or adding new features. The TreeBASE development will be discussed on this public mailing list, where everyone can weigh in, contribute ideas, and cheer on the developers. So while there isn't much here yet just now aside from three mailing lists, this prospect is very exciting to me and I'm hoping that before long TreeBASE will be regarded as a promising model for developing community database resources in a transparent, sustainable, and inclusive fashion that ensures the best value for the community. Cheers, -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== |
From: Rutger V. <rut...@gm...> - 2009-03-24 05:13:57
|
Woohoo. -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |