This request originates from https://sourceforge.net/p/cwb/bugs/80/: The desire is to have add an XML print mode to CQP that always generates valid XML.
The DTD for this print mode is entirely unclear, so suggestions (in comments below) are very welcome.
Diff:
Note to those not familiar with CQP print modes: Their implementation is a horrible mess, so we are reluctant to add extensions and very limited in what can be achieved. Moreover, the print modes only affect some CQP output (kwic concordances, frequency tables from group) but by far not all.
About the DTD, I will use the same SGML structure but XML compliant. For more ideas about the schema, FreeLing output formats could be an useful resource.
A kwic concordance (where left and right context might not even contain complete tokens!) is very different from a list of sentences with pre-determined annotation as in the FreeLing output. I don't think we can learn much from it to help us address the challenges of kwic XML output.
SGML print mode is really badly broken if you display s-attributes in the concordance. It also includes them (and any p-attributes) as plain text in the tokens rather than in a way that allows them to be processed e.g. with XSLT.
Yes, I had to found a work around the attribute separator. My suggestion about FreeLing was mainly for the possibility to display the token and its attributes as nodes, instead of plain text. But it seems it implies a lot of fixes and that is something that is going to be fix in version 4. Is that the case, is there any draft for the XML output?
No there's not, because it's (a) really far in the future and (b) not going to be remotely difficult when we actually get there.