Re: [Treebase-devel] Citation metadata for 2009 additions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Data import is not handled by perl scripts but by java programs. Step
1 is handled by org.cipres.treebase.util.AuxiliaryDataImporter. Step 2
and 3 is handled by org.cipres.treebase.util.CitationDataImporter.
Both are standalone java programs (i.e. with a main() method). They,
and related programs, are located in
treebase-core/src/main/java/org/cipres/treebase/util/*, including a
barebones package.html documentation file. These programs aren't yet
documented on the wiki.

On Tue, Feb 2, 2010 at 11:05 PM, William Piel <wil...@ya...> wrote:
> On Feb 2, 2010, at 3:23 PM, Vladimir Gapeyev wrote:
>
> I just went through ... the instructions at
> https://sourceforge.net/apps/mediawiki/treebase/index.php?title=DataDumps
>
> It should have been described there, I guess Mark J. D. forgot to add that
> step.  Can this be inferred from code that Mark has written? (e.g. perhaps
> in his perl directory).  Or do we need to contact Mark again?
> I'll go over the basics as follows:
> Step 1- after the dump.txt has been processed, what you have is a set of
> study records and a set of author records (plus all the other tables: trees,
> matrices, etc). Each study record links to a citation record where (1) the
> entire citation is stored as a single string in the title field, and (2) the
> abstract is stored in the abstract field. Additionally, for each citation
> record there is a set of author records in which author first names are
> written out in full and author emails are included where available (I guess
> this would be in the "person" table) -- unfortunately the order of the
> authors is not known (or does not reflect the real order of the names in the
> associated paper).
> Step 2- what the citations_utf8.zip file contains is a tab-separated text of
> all the citation data in much more granular form: i.e. different columns
> contain different fields -- title, journal, page numbers, etc.  So the easy
> part is to update the citation table, so that instead of storing the entire
> citation in one field, all the bits have been parsed out into separate
> field. Since one column in the citations_utf8.zip file contains the legacy
> study_id, and a field in the study table also contains the legacy study_id,
> you can use the matching between these two in order to know which citation
> record to update with which row of data.  That's the easy part.
> Step 3- the difficult part is to update the citation_author
> and  citation_editor tables with the correct order of the authors (these are
> bridge tables between the person table and the citation table).
> The citations_utf8.zip file contains a column that lists the primary authors
> separated with semi-colons, like so:
> "Lapp, H.; Piel, W. H.; Gapeyev, V."
> while the persons table has the following records, in no particular order,
> for example:
> 1  William H.  Piel  wil...@ya...
> 2   Hilmar        Lapp      hl...@ne...
> 3  Vladimir      Gapeyev   vla...@du...
> What's needed is a script that separates the string of abbreviated names
> ("Lapp, H.; Piel, W. H.; Gapeyev, V.") using the semi colon, clips out the
> last name (i.e. the beginning part to the comma), learns the order of the
> names, and then uses that to reorder the full names + email addresses by
> updating the citation_author bridge table. Likewise, the secondary authors
> column in citations_utf8.zip needs to be used to reorder the related records
> in the citation_editor bridge table. Unfortunately, it is not uncommon for
> two authors to have the same last name (e.g. the husband and wife team,
> Barbara and Mike Wingfield, have tons of records in TreeBASE), so in those
> cases you need to match the last names plus the first initial in order to
> know how to reorder the citation_author and citation_editor tables.
> Some caveats:
> 1. it may be, in fact, that Mark designed his dump.txt parser to divine the
> author order from the full citation string, in which case
> the citations_utf8.zip is trivial (you just stop at step 2 because the
> authors are already in the correct order). But I'm going to guess that he
> used the citations_utf8.zip for the reordering of authors.
> 2. the person table is supposed to be a "one" table -- meaning each person
> gets one unique record. Unfortunately, the person table does not seem to
> store the legacy author_id from TreeBASE1, so there is not and obvious way
> to insure that new publications of existing authors don't create duplicate
> person records. Please make sure that somewhere in Mark's migration scripts,
> the author names are somehow matched. For example, if the first, last, and
> email fields match, they must be the same author.
> 3. I seem to remember that Mark's scripts created twice as many authors per
> publication (i.e. all authors were duplicated). This may have been fixed by
> running another script, instead of fixing the original bug. So we need to
> beware of this.
>
> I will list the meaning of the columns in citations_utf8.zip below.
> regards,
> Bill
>
>
>
>
> Here are the columns for citations_utf8.zip:
> 1. pub_type
>
> The choices are: Book, Book Section, Conference Proceedings, Electronic
> Source, Journal Article, Thesis
>
> (these are standard Endnote categories -- I think we use fewer ones, so we
> should treat "Conference Proceedings" as "Book Section," "Electronic Source"
> as "Journal Article", and "Thesis" as "Book" -- or something like that)
>
> 2. author
>
> These are the primary authors, listed like so: "Aanen, D. K.; Kuyper, T. W.;
> Boekhout, T.; Hoekstra, R. F."
>
> 3. year
>
> All are given a year, even those that are "in press"
>
> 4. title
>
> Primary title. This field comes with its own punctuation at the end.
>
> 5. s_author
>
> Secondary authors (e.g. book editors). Same format as authors above.
>
> 6. s_title
>
> Secondary title. For Journal Articles, this column holds the journal name
> (punctuation not included). For Book Section, this holds the title of the
> book (punctuation included)
>
> 7. place_pub
>
> Only for books and book sections
>
> 8. publisher
>
> Only for books and book sections
>
> 9. volume
> Update the citation table with this.
>
> 10. num_of_vols
>
> This is an Endnote field that we don't use (no data in this column)
>
> 11. number
>
> Update the citation table with this. (Same as the "issue" number)
>
> 12. pages
> Update the citation table with this.
>
> 13. section
>
> This is an Endnote field that we don't use (no data in this column)
>
> 14. edition
>
> Likewise, no data in this column
>
> 15. isbn
>
> Likewise, no data in this column
>
> 16. label
>
> This contains either nothing or "in press". The "in press" label means that
> the volume, number, and pages data are missing -- unfortunately over 1,000
> records have this problem. Let's preserve this so that it can be searched
> on. Later, when we have some work-study students, we can have them search
> for "in press", look up the full citation, and update the records
> accordingly.
>
> 17. keywords
>
> Needs updating in the citation table.
> 18. abstract
> Probably not needed if the abstract field in the citation table already
> contains text. But if it doesn't, best to update with this version.
>
> 19. study_id
>
> This is the legacy ID needed to match these rows with the correct study
> record
>
> 20. url
>
> If not empty, it contains the correct prefix (e.g. "http://") as needed.
>
> 21. doi
>
> Does not contain a "http://" prefix -- i.e. it starts with 10. (etc).
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> The Planet: dedicated and managed hosting, cloud storage, colocation
> Stay online with enterprise data centers and the best network in the
> business
> Choose flexible plans and management services without long-term contracts
> Personal 24x7 support from experience hosting pros just a phone call away.
> http://p.sf.net/sfu/theplanet-com
> _______________________________________________
> Treebase-devel mailing list
> Tre...@li...
> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>
>

-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com