Content-type: multipart/alternative; boundary="Boundary_(ID_EvrNFY8a2Mm5cqbHtIUWZg)" --Boundary_(ID_EvrNFY8a2Mm5cqbHtIUWZg) Content-type: text/plain; charset=iso-8859-1 Content-transfer-encoding: quoted-printable 1.) My understanding of how ListSets is supposed to work is that it give= s you a list of subsets of the collection which can be harvested separately= . VuFind's harvester supports harvesting an individual set using the set s= etting in oai.ini. However, if you are harvesting the base URL without any= additional parameters, you should get the content of all sets, and I don't= think there's any benefit to using the sets... Or are you asking if we ca= n somehow embed the set information into the harvested records so that it c= an be used for things like faceting? If that is the case, it is not curren= tly supported, though you could probably manipulate the harvest tool to do = it somehow. The results could be messy, though! 2.) It looks like your diacritics problem has to do with the way entity = encoding is being handled somewhere along the line. My guess is that conte= nt is getting double-encoded somewhere so that by the time the Solr documen= t is created at the end of the process, it contains "é" instead of= "é" and this is causing the results you are seeing. It may be worthw= hile to look at the raw ContentDM output document and the post-XSLT transla= ted Solr document and see if you can figure out where the problem is being = introduced. As I said, I'm not too familiar with all of the details of XSL= T, but no matter what the problem is, I'm sure there's a way to either doub= le-decode or avoid double-encoding as necessary to get the desired end resu= lt. 3.) The latest version of import-xsl.php that I sent you with my previou= s email (along with the ojs.xsl file) includes a map_string function which = does exactly what you want. Take a look at the handling of dc:language for= an example. Note that any translation map files you specify need to be ke= pt in your import/translation_maps directory. I am attaching my diglib_lan= g_map.properties file as an example. I hope this helps - let me know if you need more details on anything! - Demian From: fapeng@notes.cc.sunysb.edu [mailto:fapeng@notes.cc.sunysb.edu] Sent: Thursday, September 30, 2010 12:45 PM To: Demian Katz Cc: vufind-tech@lists.sourceforge.net Subject: Harvesting -- ContentDM Harvesting function works, but I encountered a couple of problems. 1. Harvesting default url is using ListRecords, but no setName (collection)= info from this URL. Default URL [url]?verb=3DListRecords&metadataPrefix=3Doai_dc URL contains setName info [url]?verb=3DListSets for example: p3006coll6 New York State Maps My question is: how can we merge the two url's info into a corresponding me= tadata record? 2. How can the diacritics be handled right in OAI-PHM? Author in ContenDM: Chaussegro de L=E9ry, Gaspard-Joseph, 1682-1756 Author in Vufind : Chaussegro de Léry, Gaspard-Joseph, 1682-1756 3. Language and Format Is it possible can using mapping files (like MARC record) to clean up the m= ess of metadata input ? EN =3D English Eng =3D English en =3D English .... Maps =3D Map map =3D Map ... Thanks ************ Fang Peng Library Information System/DoIT Stony Brook University ************************ --Boundary_(ID_EvrNFY8a2Mm5cqbHtIUWZg) Content-type: text/html; charset=iso-8859-1 Content-transfer-encoding: quoted-printable

1.)&= nbsp;   My understanding of how ListSets is supposed to work is that= it gives you a list of subsets of the collection which can be harvested separately.=A0 VuFind’s harvester supports harvesting an individual s= et using the set setting in oai.ini. =A0However, if you are harvesting the bas= e URL without any additional parameters, you should get the content of all sets, = and I don’t think there’s any benefit to using the sets…=A0 O= r are you asking if we can somehow embed the set information into the harvested r= ecords so that it can be used for things like faceting?=A0 If that is the case, it= is not currently supported, though you could probably manipulate the harvest tool = to do it somehow.=A0 The results could be messy, though!

2.)&= nbsp;   It looks like your diacritics problem has to do with the way entity encoding is being handled somewhere along the line.=A0 My guess is t= hat content is getting double-encoded somewhere so that by the time the Solr document is created at the end of the process, it contains “&= #233;” instead of “é” and this is causing the results you are seeing.=A0 It may be worthwhile to look at the raw ContentDM output documen= t and the post-XSLT translated Solr document and see if you can figure out where = the problem is being introduced.=A0 As I said, I’m not too familiar with = all of the details of XSLT, but no matter what the problem is, I’m sure ther= e’s a way to either double-decode or avoid double-encoding as necessary to get = the desired end result.

3.)&= nbsp;   The latest version of import-xsl.php that I sent you with my previous email (along with the ojs.xsl file) includes a map_string function which does exactly what you want.=A0 Take a look at the handling of dc:lang= uage for an example.=A0 Note that any translation map files you specify need to = be kept in your import/translation_maps directory.=A0 I am attaching my diglib= _lang_map.properties file as an example.

 

I hope this helps – let me know if you need more detai= ls on anything!

 

- Demian

 

From: fapeng@notes.cc.sunysb.edu [mailto:fapeng@notes.cc.sunysb.edu]
Sent: Thursday, September 30, 2010 12:45 PM
To: Demian Katz
Cc: vufind-tech@lists.sourceforge.net
Subject: Harvesting -- ContentDM

 

Harves= ting function works, but I encountered a couple of problems.

1. Harves= ting default url is using ListRecords, but no setName= (collection) info from this URL.

Default U= RL [url]?verb=3DListRecords&metadataPrefix=3Doa= i_dc
URL conta= ins setName info [url]?verb=3DListSets

for examp= le:
<set>
<setSpec>p3006coll6</setSpec>
<setName>New York State = Maps</setName>
</set>


My questi= on is: how can we merge the two url's info into a corresponding metadata record? <= /span>


2. How ca= n the diacritics be handled right in OAI-PHM?

Author in ContenDM: Chaussegro de L=E9ry, Gaspard-Joseph, 1682-1756
Author in Vufind : Chaussegro de L&#233;ry,<= /span> Gaspard-Joseph, 1682-1756


3. Langua= ge and Format

Is it pos= sible can using mapping files (like MARC record) to clean up the mess of metadata input ?

EN =3D En= glish
Eng =3D E= nglish
en =3D En= glish
....

Maps =3D = Map
map =3D M= ap
...


Thanks


************
Fang Peng
Library Information System/DoIT
Stony Brook University
************************

--Boundary_(ID_EvrNFY8a2Mm5cqbHtIUWZg)--