Re: [VuFind-Tech] searching CJK and other non-latin languages

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I'm copying this to solrmarc-tech, since other SolrMarc users may have further input.

Also, there's a little bit of earlier discussion here that might be of interest:

http://groups.google.com/group/solrmarc-tech/browse_thread/thread/b27f2b68adf7a9db

In any case, at a glance, your approach sounds like a reasonable start as long as you can find a way to modify the templates without too much complication (this may get easier in VuFind 2).

If you want to start playing with the new 3.6 features, and assuming that they are available in a nightly build, it's not too difficult to upgrade Solr in VuFind.  You can look at r4673 of the trunk for an example -- basically you just replace files in solr/lib and solr/jetty/webapps with newer versions and edit the various solrconfig.xml files to reflect the current Lucene version in the <luceneMatchVersion> setting.  Restart VuFind, reindex, and you're ready to go.

I'm sure robust CJK support would be of interest to others, so if you find an approach that works well without making life more complicated for people who don't need it, it would definitely be worth adding to the trunk.  I'd recommend using the 3.6 code as the target -- no point in figuring out something that you know will be deprecated and having to do it over, and I suspect that 3.6 is not too far off.  It won't make it into VuFind 1.3, but it might justify a 1.3.1 release.

- Demian

> -----Original Message-----
> From: Tod Olson [mailto:to...@uc...]
> Sent: Tuesday, January 10, 2012 8:48 AM
> To: Demian Katz
> Cc: Tod Olson; vuf...@li...
> Subject: Re: [VuFind-Tech] searching CJK and other non-latin languages
> 
> A bit more investigation shows that in Solr 3.4, there are CJKAnalyzer
> and CJKTokenizer, presumably to do what their names suggest.  However,
> they seem to be not good enough, and, if I understand corectly, they
> will be deprecated in 3.6 (and 4.0) in favor of CJKBigramFilter and
> CJKWidthFilter. Details and patch are at:
> 
> 	https://issues.apache.org/jira/browse/LUCENE-2906
> 
> But it certainly seems like Solr 3.4 has some stuff in it to at least
> make a start as CJK retrieval. I don't think either handle the folding
> of Traditional and Simplified Chinese.
> 
> So then the question is, what's the best way to start fiddling with
> this in VuFind? Here's a possible scenario, I'd love some feedback.
> (And I hope I understand the solrmarc functions correctly):
> 
> 1. For the core fields:
> 	a. Arrange for title_vernacular and other _vernacular variants of
> the core fields, to allow for vernacular display with the core,
> probably using solrmarc's getLinkedField().
> 
> 	b. Add the _vernacular to the indexes
> 
> 2. For the non-core fields with possible 880s:
> 
> 	a. populate them with getLinkedFieldCombined() to get the
> vernacular in together with the latin
> 
> 	b. edit templates to include non-latin, just pull from the record
> for display.
> 
> Then the data will be put into the indexes, and one can start
> integrating CJKAnalyzer and CJKTokenizer into the indexing, or add the
> CJK*Filters, if woking with 3.6+.
> 
> Does that sound reasonable? Ideas for a better approach? And is this of
> general interest to the VuFind community?
> 
> -Tod
> 
> Tod Olson <to...@uc...>
> Systems Librarian
> University of Chicago Library
> On Dec 21, 2011, at 9:24 AM, Demian Katz wrote:
> 
> 
> > I feel like I've seen some recent discussion on the solr-user list
> about conditional analysis chains that allow you to set up your schema
> to process text differently based on things like source language.  I'm
> not sure if this functionality exists in a released version or is part
> of their trunk development, but it might be worth searching the
> archives or asking on the list to find out the current state of the art
> -- I suspect this could be very helpful if/when it actually exists!
> >
> > (And sorry for the rather vague suggestion -- I haven't been
> following the topic too closely).
> >
> > - Demian
> >
> >> -----Original Message-----
> >> From: Tod Olson [mailto:to...@uc...]
> >> Sent: Wednesday, December 21, 2011 10:17 AM
> >> To: vuf...@li...
> >> Subject: [VuFind-Tech] searching CJK and other non-latin languages
> >>
> >> VuFind tech,
> >>
> >> How are people handling the indexing of non-latin scripts? I'm
> thinking
> >> specifically of the CJK writing systems, but also of Hebrew, Arabic,
> >> Cyrillic, and others.
> >>
> >> A discussion about a year ago spoke of putting the vernacular into
> its
> >> own index, but I'm hoping for something more integrated into the
> >> regular indexes. Maybe bigrams for CJK data, but then it's unclear
> how
> >> to combine with latin indexes.
> >>
> >> Is there a best practice among the VuFind sites?
> >>
> >> -Tod
> >>
> >>
> >> Tod Olson <to...@uc...>
> >> Systems Librarian
> >> University of Chicago Library
> >>
> >>
> >>
> >>
> >>
> >> --------------------------------------------------------------------
> ---
> >> -------
> >> Write once. Port to many.
> >> Get the SDK and tools to simplify cross-platform app development.
> >> Create
> >> new or port existing apps to sell to consumers worldwide. Explore
> the
> >> Intel AppUpSM program developer opportunity.
> >> appdeveloper.intel.com/join
> >> http://p.sf.net/sfu/intel-appdev
> >> _______________________________________________
> >> Vufind-tech mailing list
> >> Vuf...@li...
> >> https://lists.sourceforge.net/lists/listinfo/vufind-tech
> 
>