[Gramadoir-devel] Tagging punctuation?
Status: Beta
Brought to you by:
cos
From: Kevin D. <ke...@do...> - 2007-02-01 13:02:31
|
I don't know how doable/desirable this is, but I'm going to float it anyway! My web-based frontend to the Welsh port of Gramad=C3=B3ir, Klebran, is usin= g a=20 database table to store the output, and make it easier to cross-reference=20 with dictionary lookup, etc. This means that punctuation has to be=20 abstracted from the stream and stored as well. There may be a better way t= o=20 do this than the way I am doing it (there usually is!), but at the minute I= =20 am doing some regexing to tag the punctuation, move it around, and put it=20 back again. The moving around is necessary because when I echo the text from the online= =20 form into Gramad=C3=B3ir --xml, the only reliable sentence boundaries seem = to be=20 full stop, exclamation mark, and question mark. So having an end-of-senten= ce=20 glob of punctuation like '?") needs the ? moved to the end for segmentation= =20 and storage and then moved back again for display. The test version there at the minute worked OK with internal punctuation (e= g=20 mae'n <- mae yn) originally, but in my work to "improve" punctuation handli= ng=20 I must have accidentally broken it, because apostrophes no longer appear=20 properly (breaking it was easy, becasue it was a bit ad hoc!). =20 So I've just spent the last few days redoing the punctuation handling to cu= re=20 this, and make it a bit more logical. It's now a lot more robust, although= =20 there may still be some corner cases with long sequences of multiple=20 punctuation marks, and I'll upload it shortly. Having done this, though, and in view of the --api changes, it occurred to = me=20 that I may be being stupid (Ptolemaic epicycles come to mind). Gramad=C3= =B3ir=20 =2D-xml, of course, handles punctuation by just passing it through, eg: ("Faint sydd gen ti i fynd?") becomes <line> ("<N pl=3D"n" gnd=3D"m" m=3D"1">Faint</N> <C>sydd</C> <S>gen ti= </S>=20 <S>i</S> <V m=3D"1">fynd</V>?") </line> I was wondering whether in fact there is a way of wrapping non-internal glo= bs=20 of punctuation in a tag. In other words, not internal stuff like the=20 apostrophe in mae'n or the decimal point in =C2=A32.5m or the hyphen in=20 llawn-amser, but the bits above like (" and ?") - I suppose any characters= =20 that are not already wrapped in a tag. All that would need to be noted is= =20 whether there was a space between the punctuation sequence and the tag, or= =20 not. I could then lazily take these tags, store the contents, and display them=20 without any further hassle. I could reflect on the fact that I have wasted= =20 three days, and resolve to ask first the next time! I realise this would not be XML as we know it, Jim, but it might be useful = in=20 other circumstances - I don't know. If it's impossibleor too complicated,= =20 fine - as I say, I've now got My Patented System working pretty well. But = if=20 it were doable without too much trouble, it would certainly mean one less=20 area of Klebran to maintain. =2D-=20 Pob hwyl / Best wishes Kevin Donnelly www.kyfieithu.co.uk - KDE yn Gymraeg www.eurfa.org.uk - Geiriadur rhydd i'r Gymraeg www.rhedadur.org.uk - Rhedeg berfau Cymraeg www.cymrux.org.uk - Linux Cymraeg ar un CD |