[perl-xml-cvs] CVS: perl-xml-faq perl-xml-faq.xml,1.19,1.20

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/perl-xml/perl-xml-faq
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv13042

Modified Files:
	perl-xml-faq.xml 
Log Message:
- add Q&A re splitting huge files

Index: perl-xml-faq.xml
===================================================================
RCS file: /cvsroot/perl-xml/perl-xml-faq/perl-xml-faq.xml,v
retrieving revision 1.19
retrieving revision 1.20
diff -u -d -r1.19 -r1.20

--- perl-xml-faq.xml	11 Nov 2004 09:25:15 -0000	1.19
+++ perl-xml-faq.xml	11 Feb 2005 08:06:16 -0000	1.20
@@ -18,6 +18,7 @@
     <year>2002</year>
     <year>2003</year>
     <year>2004</year>
+    <year>2005</year>
     <holder>Grant McLean</holder>
   </copyright>
 
@@ -1177,19 +1178,16 @@
     </answer>
   </qandaentry>
 
-  <qandaentry id="utf8">
-    <question>
-      <para>What is UTF-8?</para>
-    </question>
-    <answer>
+  <qandaentry id="utf8"> <question> <para>What is UTF-8?</para> </question>
+  <answer>
     
       <para>Since Unicode supports character positions higher than 256, a
-      representation of those characters will obviously require more than
-      one 8-bit byte.  There is more than one system for representing
-      Unicode characters as byte sequences.  UTF-8 is one such system.  It
-      uses a variable number of bytes (1 to 6) to represent each character.
-      This means that the most common characters (ie: 7 bit ASCII) only
-      require one byte.</para>
+      representation of those characters will obviously require more than one
+      8-bit byte.  There is more than one system for representing Unicode
+      characters as byte sequences.  UTF-8 is one such system.  It uses a
+      variable number of bytes (from 1 to 4 according to RFC3629) to represent
+      each character.  This means that the most common characters (ie: 7 bit
+      ASCII) only require one byte.</para>
 
       <para>In UTF-8 encoded data, the most significant bit of each byte will
       be 0 for single byte characters and 1 for each byte of a multibyte
@@ -1212,8 +1210,6 @@
 2 byte character 110xxxxx 10xxxxxx
 3 byte character 1110xxxx 10xxxxxx 10xxxxxx
 4 byte character 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
-5 byte character 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
-6 byte character 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
 	]]></programlisting>
 
       <para>UTF-16 encoding is an alternative byte representation of Unicode
@@ -2022,7 +2018,23 @@
 
     </answer>
   </qandaentry>
-  
+
+  <qandaentry id="file_split">
+    <question>
+      <para>How can I split a huge XML file into smaller chunks</para>
+    </question>
+    <answer>
+
+      <para>When your document is too large to slurp into memory, the DOM,
+      XPath and XSLT tools can't really help you.  You could write your own SAX
+      filter fairly easily, but Michel Rodriguez has written a <ulink
+      url="http://www.perlmonks.org/index.pl?node_id=429707">general
+      solution</ulink> so you don't have to.  You'll find it bundled with XML::Twig
+      from version 3.16.</para>
+
+    </answer>
+  </qandaentry>
+
 </qandadiv>
 
 <qandadiv id="xml_problems"><title>Common XML Problems</title>