From: Christiaan H. <cmh...@gm...> - 2010-09-09 09:09:59
|
On Sep 9, 2010, at 3:01, Adam R. Maxwell wrote: > > Indeed, serious thinking in American popular religion must be the most valuable commodity on earth. It certainly seems to be the scarcest. -- Os Guinness, "The Gravedigger File" > > On Sep 8, 2010, at 5:08 PM, Christiaan Hofman wrote: > >> >> On Sep 9, 2010, at 1:57, Maxwell, Adam R wrote: >> >>> >>> On Sep 8, 2010, at 16:44, Christiaan Hofman wrote: >>> >>>> And remember that the whole point of this cleaning is to correct invalid input from the user. >>> >>> It's to correct invalid input from online sources, and DOI in particular. Normalizing it by unescaping and then escaping is the only way to fix those. With your change, you may as well remove that method entirely. >> >> It's not doing nothing. What else would be invalid, apart from possibly containing characters that should be escaped? > > Standalone % and # characters are valid in DOIs. > > http://www.doi.org/handbook_2000/enumeration.html#2.5 Than the old version did not do that right either, as it left # intact. It did things bad for valid URLs though, which is worse. I don't see how one can know whether there is a % that should be escaped, and unescaping first is clearly wrong. Therefore I think it's better to leave them alone, also as they will probably be very rare anyway. Perhaps we can just do the old version for strings we know are coming from a DOI (like a Doi field). Christiaan |