From: Vladimir G. <vla...@du...> - 2010-11-04 18:14:54
|
[I am cc'ing this thread to treebase-devel, per Hilmar's request] I will add to OAI record creation a keyword tokenizer that uses "," and ";" as delimiters. Here is another issue: Lots of entries in Treebase contain "in press" as the keyword, which Kevin's code on the Dryad side does not accept as such. Should this elimination of "in press" remain on the Dryad side? I.e., is "in press" a valid keyword from Treebase point of view? If not, should this be fixed by erasing "in press" values from the DB or by filtering them out of OAI records sent out? --Vladimir On Nov 3, 2010, at 5:10 PM, William Piel wrote: > > The solution of splitting the string using either commas or semi- > colons looks fine to me. (and why should it be a problem if the > string has a mix of the two? Seems to me that it should work by > spitting on either.) It's not impossible that some authors will use > other non-standard delimiters, such as the long hyphen (" — ") the > double hyphen ("--") and the bullet " • " but there's only so many > options for us to accommodate. > > bp > > > > On Nov 3, 2010, at 4:58 PM, Hilmar Lapp wrote: > >> I think Bill Piel will need to at least chime in here, and possibly >> others. Would you mind posting this to the treebase-devel list? >> >> -hilmar >> >> On Nov 3, 2010, at 2:56 PM, Vladimir Gapeyev wrote: >> >>> I've put this request for clarification in SF tracker, please >>> advise. --Vladimir >>> >>> Begin forwarded message: >>> >>>> From: "SourceForge.net" <no...@so...> >>>> Date: November 3, 2010 2:49:37 PM EDT >>>> To: no...@so... >>>> Subject: [Treebase-guts] [ treebase-Bugs-3079602 ] OAI records >>>> contain all subjects in a single field >>>> >>>> >>>> Initial Comment: >>>> In the OAI records, each <dc:subject> field contains many >>>> keywords, separated by commas, like this: >>>> >>>> <dc:subject> >>>> Ascomycota, Pezizomycotina, Dothideomyceta, fungal evolution, >>>> lichens, multigene phylogeny, phylogenomics, plant pathogens, >>>> saprobes, Tree of Life >>>> </dc:subject> >>>> >>>> It is best practice to put each keyword into a separate >>>> <dc:subject> field. This allows harvesting systems (like Dryad) >>>> to accurately separate the keywords, and not worry about keywords >>>> that may contain commas. >>>> >>>> ---------------------------------------------------------------------- >>>> >>>>> Comment By: Vladimir Gapeyev (vgapeyev) >>>> Date: 2010-11-03 14:49 >>>> >>>> Message: >>>> This is a request for clarification. >>>> >>>> Treebase UI offers a single field to enter keywords, text from >>>> which is >>>> stored in a single field in the database. From the data in >>>> treebase-dev I >>>> see that users used ',' or ';' to separate multiple keywords. >>>> >>>> Here is what I can do: Get Kevin's keyword-splitting code and >>>> place it on >>>> Treebase side, modifying if necessary to work with both ';' and >>>> ','. This >>>> would not work nicely if the user has a fancy to use comma- >>>> containing >>>> keywords separated by semicolons, or the other way around. >>>> >>>> Please confirm that this is what is needed. > > > > |