From: Lars M. I. <la...@gn...> - 2004-07-16 12:00:39
|
We've just gotten a news feed that looks like this: <ARTIKKEL> <TITTEL><![CDATA[Stalltips fra Warren]]></TITTEL> <KATEGORI><![CDATA[Aksjetips]]></KATEGORI> <DATO><![CDATA[16.07.04 10:56]]></DATO> </ARTIKKEL> This is, according to people who know XML, valid. clocc doesn't seem to be able to parse this -- it just gives a backtrace. So here's a quick patch that reads a marked section and just returns the text in the section. Index: cllib-xml.lisp =================================================================== RCS file: /home/cvs/backoffice/clocc/cllib-xml.lisp,v retrieving revision 1.1 diff -c -r1.1 cllib-xml.lisp *** cllib-xml.lisp 2 Jun 2004 13:59:54 -0000 1.1 --- cllib-xml.lisp 16 Jul 2004 11:52:16 -0000 *************** *** 745,767 **** 'read-xml stream last) (make-xml-decl :name name :args (nbutlast atts)))) (#\! ! (if (char= #\- (peek-char nil stream)) ! (let ((ch (progn (read-char stream) (read-char stream t nil t)))) ! (assert (char= #\- ch) (ch) ! "~s: cannot handle: <!-~c" 'read-xml ch) ! (make-xml-comment :data (xml-read-comment stream))) ! (let ((obj (read stream t nil t))) ! (case obj ! (xml-tags::entity (make-xml-comment ! :data (xml-read-entity stream))) ! ((xml-tags::doctype xml-tags::element xml-tags::attlist ! xml-tags::notation) ! (make-xml-misc :type obj :data ! (read-delimited-list #\> stream t))) ! (t (warn "~s: what is `~s'? proceed, with fingers crossed..." ! 'read-xml obj) ! (cons obj (xml-list-to-alist ! (read-delimited-list #\> stream t)))))))) (t (unread-char ch stream) (xml-read-tag stream))))) ;; do not need `xml-list-to-alist' in <!DOCTYPE foo [...]> --- 745,774 ---- 'read-xml stream last) (make-xml-decl :name name :args (nbutlast atts)))) (#\! ! (cond ! ((char= #\- (peek-char nil stream)) ! (let ((ch (progn (read-char stream) (read-char stream t nil t)))) ! (assert (char= #\- ch) (ch) ! "~s: cannot handle: <!-~c" 'read-xml ch) ! (make-xml-comment :data (xml-read-comment stream)))) ! ((char= #\[ (peek-char nil stream)) ! (let ((section (read-section stream))) ! (format t "Read section ~s~%" section) ! (assert (eql (read-char stream nil nil) #\>)) ! (cadr section))) ! (t ! (let ((obj (read stream t nil t))) ! (case obj ! (xml-tags::entity (make-xml-comment ! :data (xml-read-entity stream))) ! ((xml-tags::doctype xml-tags::element xml-tags::attlist ! xml-tags::notation) ! (make-xml-misc :type obj :data ! (read-delimited-list #\> stream t))) ! (t (warn "~s: what is `~s'? proceed, with fingers crossed..." ! 'read-xml obj) ! (cons obj (xml-list-to-alist ! (read-delimited-list #\> stream t))))))))) (t (unread-char ch stream) (xml-read-tag stream))))) ;; do not need `xml-list-to-alist' in <!DOCTYPE foo [...]> *************** *** 786,791 **** --- 793,818 ---- (if (find (peek-char t stream t nil t) "&%<>" :test #'char=) (read stream t nil t) (values (xml-read-text stream "<&"))))))))) + + (defun read-section (stream) + (let ((brackets 0) + strings chars) + (loop for char = (read-char stream nil nil) + do + (progn + (if (or (eql char #\[) + (eql char #\])) + (progn + (if (eql char #\[) + (incf brackets) + (decf brackets)) + (when chars + (push (coerce (nreverse chars) 'string) strings) + (setq chars nil))) + (push char chars))) + while (and char + (not (zerop brackets)))) + (nreverse strings))) ;;; ;;; UI -- (domestic pets only, the antidote for overdose, milk.) la...@gn... * Lars Magne Ingebrigtsen |
From: Sam S. <sd...@gn...> - 2004-07-16 15:33:53
|
> * Lars Magne Ingebrigtsen <yn...@ta...t> [2004-07-16 13:55:22 +0200]: > > We've just gotten a news feed that looks like this: > > <ARTIKKEL> > <TITTEL><![CDATA[Stalltips fra Warren]]></TITTEL> > <KATEGORI><![CDATA[Aksjetips]]></KATEGORI> > <DATO><![CDATA[16.07.04 10:56]]></DATO> > </ARTIKKEL> > > This is, according to people who know XML, valid. indeed. > clocc doesn't seem to be able to parse this -- it just gives a > backtrace. oops! > So here's a quick patch that reads a marked section and just returns > the text in the section. Thanks. For the future, I prefer "cvs diff -uw". Please try the appended patch. -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> Those who don't know lisp are destined to reinvent it, poorly. --- xml.lisp 26 Sep 2003 11:28:57 -0400 2.46 +++ xml.lisp 16 Jul 2004 11:16:21 -0400 @@ -745,12 +745,25 @@ 'read-xml stream last) (make-xml-decl :name name :args (nbutlast atts)))) (#\! - (if (char= #\- (peek-char nil stream)) + (case (peek-char nil stream) + (#\- ; comment <!-- ... --> (let ((ch (progn (read-char stream) (read-char stream t nil t)))) (assert (char= #\- ch) (ch) "~s: cannot handle: <!-~c" 'read-xml ch) - (make-xml-comment :data (xml-read-comment stream))) - (let ((obj (read stream t nil t))) + (make-xml-comment :data (xml-read-comment stream)))) + (#\[ ; character data <![CDATA[ ... ]]> + (let ((res (make-array 20 :adjustable t :element-type 'character + :fill-pointer #.(length "[CDATA[")))) + (assert (and (= (read-sequence res stream) 7) + (string= res "[CDATA[")) + (res) "~s: cannot handle: <!~a" 'read-xml res) + (setf (fill-pointer res) 0) + (loop :for len = (vector-push-extend + (read-char stream t nil t) res) + :until (and (>= len 3) (string= "]]>" res :start2 (- len 2))) + :finally (setf (fill-pointer res) (- len 2))) + res)) + (t (let ((obj (read stream t nil t))) (case obj (xml-tags::entity (make-xml-comment :data (xml-read-entity stream))) @@ -761,7 +774,7 @@ (t (warn "~s: what is `~s'? proceed, with fingers crossed..." 'read-xml obj) (cons obj (xml-list-to-alist - (read-delimited-list #\> stream t)))))))) + (read-delimited-list #\> stream t))))))))) (t (unread-char ch stream) (xml-read-tag stream))))) ;; do not need `xml-list-to-alist' in <!DOCTYPE foo [...]> |
From: Lars M. I. <la...@gn...> - 2004-07-17 10:21:16
|
Sam Steingold <sd...@gn...> writes: > Please try the appended patch. Works perfectly; thanks. By the way, in a different feed we got stuff with ' entities in them. This isn't defined in the entities.xml file, but it's supposed to be a single quote. (The w3 page referred to in the entities.xml file doesn't list this entity, but it's apparently a standard entity. But I'm not XML expert. :-) Here's a patch that adds the entity: cvs diff: Diffing . Index: entities.xml =================================================================== RCS file: /home/cvs/backoffice/data/entities.xml,v retrieving revision 1.1 diff -u -w -r1.1 entities.xml --- entities.xml 14 Jul 2004 13:43:11 -0000 1.1 +++ entities.xml 17 Jul 2004 10:13:58 -0000 @@ -1,4 +1,5 @@ <!-- http://www.w3.org/TR/WD-html40-970708/sgml/entities.html --> +<!ENTITY apos CDATA "'" -- single quote --> <!ENTITY nbsp CDATA " " -- no-break space --> <!ENTITY iexcl CDATA "¡" -- inverted exclamation mark --> <!ENTITY cent CDATA "¢" -- cent sign --> -- (domestic pets only, the antidote for overdose, milk.) la...@gn... * Lars Magne Ingebrigtsen |
From: Sam S. <sd...@gn...> - 2004-07-17 20:19:03
|
done -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> If a train station is a place where a train stops, what's a workstation? |