MMAX2 GUI writes out invalid XML
Brought to you by:
ottiram
In my input "words" XML file, I have some invalid characters (e.g., <, >, etc.), which I deal with by wrapping the text of each word in CDATA tags. MMAX2 reads these in fine; however, if I modify the base data from within MMAX2, the resulting words XML file (created by MMAX2) does not escape these characters or wrap them in CDATA tags. So, when I go to read the words XML file back in again on a subsequent opening of MMA2, it fails with XML parsing errors. The proper way to handle this is to always write out valid XML.
I am currently fixing this...looks like there is some simple logic in MAXX2Discourse.java that is supposed to handle this, but it's not correct. It checks whether the entire word is equal to "<", etc., but this doesn't cover cases like "<40". The simplest solution is to use CDATA tags. Will post the patch soon.
The diff I created was nasty due to my general editor format being different. Here is the relevant code (very simple):
fw.write("<word " + currentAttributes.trim() + ">");
Node childNode = currentWordNode.getFirstChild();
String childText = childNode.getNodeValue();
fw.write("<![CDATA[" + childText + "]]></word>\n");
No need to do all the if-else checking.
I've attached a patch for the fix mentioned below. The patch also includes a change such that the markables are written sorted by markable ID. This is handy when it comes to version control of annotation data.