Re: [Treebase-devel] Citation metadata for 2009 additions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Feb 2, 2010, at 3:23 PM, Vladimir Gapeyev wrote:

> I just went through ... the instructions at https://sourceforge.net/apps/mediawiki/treebase/index.php?title=DataDumps

It should have been described there, I guess Mark J. D. forgot to add that step.  Can this be inferred from code that Mark has written? (e.g. perhaps in his perl directory).  Or do we need to contact Mark again?

I'll go over the basics as follows:

Step 1- after the dump.txt has been processed, what you have is a set of study records and a set of author records (plus all the other tables: trees, matrices, etc). Each study record links to a citation record where (1) the entire citation is stored as a single string in the title field, and (2) the abstract is stored in the abstract field. Additionally, for each citation record there is a set of author records in which author first names are written out in full and author emails are included where available (I guess this would be in the "person" table) -- unfortunately the order of the authors is not known (or does not reflect the real order of the names in the associated paper). 

Step 2- what the citations_utf8.zip file contains is a tab-separated text of all the citation data in much more granular form: i.e. different columns contain different fields -- title, journal, page numbers, etc.  So the easy part is to update the citation table, so that instead of storing the entire citation in one field, all the bits have been parsed out into separate field. Since one column in the citations_utf8.zip file contains the legacy study_id, and a field in the study table also contains the legacy study_id, you can use the matching between these two in order to know which citation record to update with which row of data.  That's the easy part.

Step 3- the difficult part is to update the citation_author and  citation_editor tables with the correct order of the authors (these are bridge tables between the person table and the citation table). The citations_utf8.zip file contains a column that lists the primary authors separated with semi-colons, like so:  

"Lapp, H.; Piel, W. H.; Gapeyev, V."  

while the persons table has the following records, in no particular order, for example:

1   William H.    Piel      wil...@ya...
2   Hilmar        Lapp      hl...@ne...
3   Vladimir      Gapeyev   vla...@du...

What's needed is a script that separates the string of abbreviated names ("Lapp, H.; Piel, W. H.; Gapeyev, V.") using the semi colon, clips out the last name (i.e. the beginning part to the comma), learns the order of the names, and then uses that to reorder the full names + email addresses by updating the citation_author bridge table. Likewise, the secondary authors column in citations_utf8.zip needs to be used to reorder the related records in the citation_editor bridge table. Unfortunately, it is not uncommon for two authors to have the same last name (e.g. the husband and wife team, Barbara and Mike Wingfield, have tons of records in TreeBASE), so in those cases you need to match the last names plus the first initial in order to know how to reorder the citation_author and citation_editor tables. 

Some caveats: 

1. it may be, in fact, that Mark designed his dump.txt parser to divine the author order from the full citation string, in which case the citations_utf8.zip is trivial (you just stop at step 2 because the authors are already in the correct order). But I'm going to guess that he used the citations_utf8.zip for the reordering of authors. 

2. the person table is supposed to be a "one" table -- meaning each person gets one unique record. Unfortunately, the person table does not seem to store the legacy author_id from TreeBASE1, so there is not and obvious way to insure that new publications of existing authors don't create duplicate person records. Please make sure that somewhere in Mark's migration scripts, the author names are somehow matched. For example, if the first, last, and email fields match, they must be the same author. 

3. I seem to remember that Mark's scripts created twice as many authors per publication (i.e. all authors were duplicated). This may have been fixed by running another script, instead of fixing the original bug. So we need to beware of this. 

I will list the meaning of the columns in citations_utf8.zip below.

regards,

Bill

Here are the columns for citations_utf8.zip:

1. pub_type

The choices are: Book, Book Section, Conference Proceedings, Electronic Source, Journal Article, Thesis

(these are standard Endnote categories -- I think we use fewer ones, so we should treat "Conference Proceedings" as "Book Section," "Electronic Source" as "Journal Article", and "Thesis" as "Book" -- or something like that)

2. author

These are the primary authors, listed like so: "Aanen, D. K.; Kuyper, T. W.; Boekhout, T.; Hoekstra, R. F."

3. year

All are given a year, even those that are "in press"

4. title

Primary title. This field comes with its own punctuation at the end.

5. s_author

Secondary authors (e.g. book editors). Same format as authors above.

6. s_title

Secondary title. For Journal Articles, this column holds the journal name (punctuation not included). For Book Section, this holds the title of the book (punctuation included)

7. place_pub

Only for books and book sections

8. publisher

Only for books and book sections

9. volume

Update the citation table with this.

10. num_of_vols

This is an Endnote field that we don't use (no data in this column)

11. number

Update the citation table with this. (Same as the "issue" number)

12. pages

Update the citation table with this.

13. section

This is an Endnote field that we don't use (no data in this column)

14. edition

Likewise, no data in this column

15. isbn

Likewise, no data in this column

16. label

This contains either nothing or "in press". The "in press" label means that the volume, number, and pages data are missing -- unfortunately over 1,000 records have this problem. Let's preserve this so that it can be searched on. Later, when we have some work-study students, we can have them search for "in press", look up the full citation, and update the records accordingly.

17. keywords

Needs updating in the citation table.

18. abstract

Probably not needed if the abstract field in the citation table already contains text. But if it doesn't, best to update with this version.

19. study_id

This is the legacy ID needed to match these rows with the correct study record

20. url

If not empty, it contains the correct prefix (e.g. "http://") as needed.

21. doi

Does not contain a "http://" prefix -- i.e. it starts with 10. (etc).