The attached file contains a number of xml encoded single and double quote characters. I have been trying to use the unesc feature of xml starlet to unescape them in order to facilitate a data comparison. Sometimes instead of replacing a full """ or "'" with " or ' respectively, instead this tool simply strips the ampersand. When running the attached file through "xml unesc", the last sentence starts with "Kerryapos;s confirmation" instead of "Kerry's confirmation"
POL-Kerry-Secretary-Of-State-Confirmation-5
I recognize the output of sending this document through unesc may well be invalid xml. Don't worry--I'm not assuming it will be valid xml. I'm just trying to remove a class of differences (xml encodings) between a set of files I'm comparing in order to unmask other differences.
unesc reads lines into a 4096 byte buffer and it wasn't handling the case when an entity started at the 4095th byte of the line.
Fixed in commit a9f8ec60a3510082bb8807d928805d17ce89222a.
Fixed in 1.5.0