[VuFind-Tech] Proposed changes to VuFindIndexer.getFormat, Part 2

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Okay, here is my proposed changes.

  https://gist.github.com/1149060

It's quite lengthy, but here is why I think it addresses the flaws I mentioned in the previous email.

1. It is much more complete.

As you can see in the enumerations at the top of the file, there are many more format designations than in the current getFormat method.

This goes essentially two levels deep for material types and content types.  MARC, amazingly, has even more detailed formats, at least for some material types.  For example, you can divide books into 'encyclopedias', 'dictionaries', 'yearbooks', and so on.  And music is even more detailed.  But I had to stop somewhere, so I didn't parse these out.  It also picks up secondary content types (from the 006).

2. It distinguishes between content type and media/carrier type.

There is a getMediaTypes method that essentially parses the 007, plus a few other fields.  The getContentTypes method does the same for the leader/008/006 and a few data fields.   Each returns all available values.

If I've done my job well, then these two 'lower-level' functions should not need to be customized by libraries.  (If there is something missing or wrong here, it should be corrected in the distro.)

Instead, my thought here is that you could have 'higher-level' functions, or BeanShell scripts, that utilize, combine, or otherwise customize these values for the actual indexing.

The getPrimaryContentTypePlusOnline method is an example of this.  It takes just the first content type from the getContentTypes set, and then also checks if the item is online.  It then combines the content type 'Book' and the media type 'Online' into a single combined type called 'EBook'.

But this is just one of many different examples of how you might do this.  I think this allows for a great deal of flexibility without people having to localize or re-write a ton of code.

3. It makes a best guess attempt at determining if this is an online resource.

The SolrMarc indexer already includes a getFullTextUrls method that checks to see if the record has a full-text link.  It is a  'best guess' since many MARC records infamously contain links to table of contents and other information *about* the item without always consistently marking them as such.

But, all things considered, I think this is much preferable to the current indexer which makes no such effort, at least for format.

So there it is, at least as far as the basic issues are concerned.

There are, IMO, some other (minor) improvements over the current getFormat method.

One of the other complaints I have with the current getFormat function is that it could be much better commented.  I essentially took the MARC standards documentation at loc.gov and cut-and-paste the relevant portions into my file, and then wrote the code around that.  So hopefully it's well commented in a way that corresponds with the documentation online.

Also, following the BlackLight indexer, I used Enum's for the format values rather than just strings.  That way,  I didn't accidentally typo one of the formats.

Comments, criticisms, questions all welcome.

Except from Demian.  Go get some sleep first. ;-)

--Dave

==================
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu