From: SourceForge.net <no...@so...> - 2010-07-23 19:10:49
|
Bugs item #3032847, was opened at 2010-07-21 19:15 Message generated for change (Comment added) made by sfrgpiel You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1126676&aid=3032847&group_id=248804 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: data Group: None >Status: Closed Priority: 7 Private: No Submitted By: Kevin S. Clarke (ksclarke) Assigned to: youjun guo (youjun) Summary: OAI provider returns invalid XML for 12 records Initial Comment: Twelve of the oai_dc records returned from the OAI provider are not valid XML because they contain a Unicode character (0x1a) that is invalid for XML. The twelve records are: TreeBASE.org/study/TB2:s955 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. TreeBASE.org/study/TB2:s1119 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. TreeBASE.org/study/TB2:s1226 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. TreeBASE.org/study/TB2:s1641 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. TreeBASE.org/study/TB2:s1731 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. TreeBASE.org/study/TB2:s1779 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. TreeBASE.org/study/TB2:s1816 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. TreeBASE.org/study/TB2:s1862 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. TreeBASE.org/study/TB2:s1945 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. TreeBASE.org/study/TB2:s2028 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. TreeBASE.org/study/TB2:s2146 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. TreeBASE.org/study/TB2:s2248 An invalid XML character (Unicode: 0x1a) was found in the element content of the document. Perhaps strip the character before it is returned by the OAI provider? ---------------------------------------------------------------------- >Comment By: William Piel (sfrgpiel) Date: 2010-07-23 15:10 Message: I've edited each record to remove the badly formed characters. This is the result of artifacts created during migration when character sets were not consistent. ---------------------------------------------------------------------- Comment By: Kevin S. Clarke (ksclarke) Date: 2010-07-22 08:08 Message: My mistake... the output in the browser does have the encoding. XOM must strip it if its the default utf8. ---------------------------------------------------------------------- Comment By: youjun guo (youjun) Date: 2010-07-22 07:48 Message: The velocity template I am using to create oai-pmh response in treebase ../vmFiles/head.vm looks like this: every response should include this file at the beginning <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> ---------------------------------------------------------------------- Comment By: Kevin S. Clarke (ksclarke) Date: 2010-07-21 22:40 Message: The XML coming out doesn't have an explicit encoding set: <?xml version="1.0"?> so if the text in the XML isn't UTF-8 that could be the problem... not sure what character set is used in the database. ---------------------------------------------------------------------- Comment By: Hilmar Lapp (hlapp) Date: 2010-07-21 22:17 Message: See also http://www.w3schools.com/XML/xml_encoding.asp ---------------------------------------------------------------------- Comment By: Hilmar Lapp (hlapp) Date: 2010-07-21 22:16 Message: Youjun - are you setting the document encoding expressly to UTF-8? I.e. do you have <?xml version="1.0" encoding="UTF-8"?> as the first line? Also, the encoding the character is in *must* match the encoding given in the <?xml?> line. ---------------------------------------------------------------------- Comment By: youjun guo (youjun) Date: 2010-07-21 21:52 Message: This problem cause by some foreign language letters exist in the Treebase tables especially in table person fields lastname or firestName. We discussed this before. Maybe we need to use some other charset instead of utf-8 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1126676&aid=3032847&group_id=248804 |