From: Rutger V. <rut...@gm...> - 2010-02-03 14:37:29
|
Data import is not handled by perl scripts but by java programs. Step 1 is handled by org.cipres.treebase.util.AuxiliaryDataImporter. Step 2 and 3 is handled by org.cipres.treebase.util.CitationDataImporter. Both are standalone java programs (i.e. with a main() method). They, and related programs, are located in treebase-core/src/main/java/org/cipres/treebase/util/*, including a barebones package.html documentation file. These programs aren't yet documented on the wiki. On Tue, Feb 2, 2010 at 11:05 PM, William Piel <wil...@ya...> wrote: > On Feb 2, 2010, at 3:23 PM, Vladimir Gapeyev wrote: > > I just went through ... the instructions at > https://sourceforge.net/apps/mediawiki/treebase/index.php?title=DataDumps > > It should have been described there, I guess Mark J. D. forgot to add that > step. Can this be inferred from code that Mark has written? (e.g. perhaps > in his perl directory). Or do we need to contact Mark again? > I'll go over the basics as follows: > Step 1- after the dump.txt has been processed, what you have is a set of > study records and a set of author records (plus all the other tables: trees, > matrices, etc). Each study record links to a citation record where (1) the > entire citation is stored as a single string in the title field, and (2) the > abstract is stored in the abstract field. Additionally, for each citation > record there is a set of author records in which author first names are > written out in full and author emails are included where available (I guess > this would be in the "person" table) -- unfortunately the order of the > authors is not known (or does not reflect the real order of the names in the > associated paper). > Step 2- what the citations_utf8.zip file contains is a tab-separated text of > all the citation data in much more granular form: i.e. different columns > contain different fields -- title, journal, page numbers, etc. So the easy > part is to update the citation table, so that instead of storing the entire > citation in one field, all the bits have been parsed out into separate > field. Since one column in the citations_utf8.zip file contains the legacy > study_id, and a field in the study table also contains the legacy study_id, > you can use the matching between these two in order to know which citation > record to update with which row of data. That's the easy part. > Step 3- the difficult part is to update the citation_author > and citation_editor tables with the correct order of the authors (these are > bridge tables between the person table and the citation table). > The citations_utf8.zip file contains a column that lists the primary authors > separated with semi-colons, like so: > "Lapp, H.; Piel, W. H.; Gapeyev, V." > while the persons table has the following records, in no particular order, > for example: > 1 William H. Piel wil...@ya... > 2 Hilmar Lapp hl...@ne... > 3 Vladimir Gapeyev vla...@du... > What's needed is a script that separates the string of abbreviated names > ("Lapp, H.; Piel, W. H.; Gapeyev, V.") using the semi colon, clips out the > last name (i.e. the beginning part to the comma), learns the order of the > names, and then uses that to reorder the full names + email addresses by > updating the citation_author bridge table. Likewise, the secondary authors > column in citations_utf8.zip needs to be used to reorder the related records > in the citation_editor bridge table. Unfortunately, it is not uncommon for > two authors to have the same last name (e.g. the husband and wife team, > Barbara and Mike Wingfield, have tons of records in TreeBASE), so in those > cases you need to match the last names plus the first initial in order to > know how to reorder the citation_author and citation_editor tables. > Some caveats: > 1. it may be, in fact, that Mark designed his dump.txt parser to divine the > author order from the full citation string, in which case > the citations_utf8.zip is trivial (you just stop at step 2 because the > authors are already in the correct order). But I'm going to guess that he > used the citations_utf8.zip for the reordering of authors. > 2. the person table is supposed to be a "one" table -- meaning each person > gets one unique record. Unfortunately, the person table does not seem to > store the legacy author_id from TreeBASE1, so there is not and obvious way > to insure that new publications of existing authors don't create duplicate > person records. Please make sure that somewhere in Mark's migration scripts, > the author names are somehow matched. For example, if the first, last, and > email fields match, they must be the same author. > 3. I seem to remember that Mark's scripts created twice as many authors per > publication (i.e. all authors were duplicated). This may have been fixed by > running another script, instead of fixing the original bug. So we need to > beware of this. > > I will list the meaning of the columns in citations_utf8.zip below. > regards, > Bill > > > > > Here are the columns for citations_utf8.zip: > 1. pub_type > > The choices are: Book, Book Section, Conference Proceedings, Electronic > Source, Journal Article, Thesis > > (these are standard Endnote categories -- I think we use fewer ones, so we > should treat "Conference Proceedings" as "Book Section," "Electronic Source" > as "Journal Article", and "Thesis" as "Book" -- or something like that) > > 2. author > > These are the primary authors, listed like so: "Aanen, D. K.; Kuyper, T. W.; > Boekhout, T.; Hoekstra, R. F." > > 3. year > > All are given a year, even those that are "in press" > > 4. title > > Primary title. This field comes with its own punctuation at the end. > > 5. s_author > > Secondary authors (e.g. book editors). Same format as authors above. > > 6. s_title > > Secondary title. For Journal Articles, this column holds the journal name > (punctuation not included). For Book Section, this holds the title of the > book (punctuation included) > > 7. place_pub > > Only for books and book sections > > 8. publisher > > Only for books and book sections > > 9. volume > Update the citation table with this. > > 10. num_of_vols > > This is an Endnote field that we don't use (no data in this column) > > 11. number > > Update the citation table with this. (Same as the "issue" number) > > 12. pages > Update the citation table with this. > > 13. section > > This is an Endnote field that we don't use (no data in this column) > > 14. edition > > Likewise, no data in this column > > 15. isbn > > Likewise, no data in this column > > 16. label > > This contains either nothing or "in press". The "in press" label means that > the volume, number, and pages data are missing -- unfortunately over 1,000 > records have this problem. Let's preserve this so that it can be searched > on. Later, when we have some work-study students, we can have them search > for "in press", look up the full citation, and update the records > accordingly. > > 17. keywords > > Needs updating in the citation table. > 18. abstract > Probably not needed if the abstract field in the citation table already > contains text. But if it doesn't, best to update with this version. > > 19. study_id > > This is the legacy ID needed to match these rows with the correct study > record > > 20. url > > If not empty, it contains the correct prefix (e.g. "http://") as needed. > > 21. doi > > Does not contain a "http://" prefix -- i.e. it starts with 10. (etc). > > > > > > > ------------------------------------------------------------------------------ > The Planet: dedicated and managed hosting, cloud storage, colocation > Stay online with enterprise data centers and the best network in the > business > Choose flexible plans and management services without long-term contracts > Personal 24x7 support from experience hosting pros just a phone call away. > http://p.sf.net/sfu/theplanet-com > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com |