Update of /cvsroot/perl-xml/perl-xml-faq
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv13042
Modified Files:
perl-xml-faq.xml
Log Message:
- add Q&A re splitting huge files
Index: perl-xml-faq.xml
===================================================================
RCS file: /cvsroot/perl-xml/perl-xml-faq/perl-xml-faq.xml,v
retrieving revision 1.19
retrieving revision 1.20
diff -u -d -r1.19 -r1.20
--- perl-xml-faq.xml 11 Nov 2004 09:25:15 -0000 1.19
+++ perl-xml-faq.xml 11 Feb 2005 08:06:16 -0000 1.20
@@ -18,6 +18,7 @@
<year>2002</year>
<year>2003</year>
<year>2004</year>
+ <year>2005</year>
<holder>Grant McLean</holder>
</copyright>
@@ -1177,19 +1178,16 @@
</answer>
</qandaentry>
- <qandaentry id="utf8">
- <question>
- <para>What is UTF-8?</para>
- </question>
- <answer>
+ <qandaentry id="utf8"> <question> <para>What is UTF-8?</para> </question>
+ <answer>
<para>Since Unicode supports character positions higher than 256, a
- representation of those characters will obviously require more than
- one 8-bit byte. There is more than one system for representing
- Unicode characters as byte sequences. UTF-8 is one such system. It
- uses a variable number of bytes (1 to 6) to represent each character.
- This means that the most common characters (ie: 7 bit ASCII) only
- require one byte.</para>
+ representation of those characters will obviously require more than one
+ 8-bit byte. There is more than one system for representing Unicode
+ characters as byte sequences. UTF-8 is one such system. It uses a
+ variable number of bytes (from 1 to 4 according to RFC3629) to represent
+ each character. This means that the most common characters (ie: 7 bit
+ ASCII) only require one byte.</para>
<para>In UTF-8 encoded data, the most significant bit of each byte will
be 0 for single byte characters and 1 for each byte of a multibyte
@@ -1212,8 +1210,6 @@
2 byte character 110xxxxx 10xxxxxx
3 byte character 1110xxxx 10xxxxxx 10xxxxxx
4 byte character 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
-5 byte character 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
-6 byte character 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
]]></programlisting>
<para>UTF-16 encoding is an alternative byte representation of Unicode
@@ -2022,7 +2018,23 @@
</answer>
</qandaentry>
-
+
+ <qandaentry id="file_split">
+ <question>
+ <para>How can I split a huge XML file into smaller chunks</para>
+ </question>
+ <answer>
+
+ <para>When your document is too large to slurp into memory, the DOM,
+ XPath and XSLT tools can't really help you. You could write your own SAX
+ filter fairly easily, but Michel Rodriguez has written a <ulink
+ url="http://www.perlmonks.org/index.pl?node_id=429707">general
+ solution</ulink> so you don't have to. You'll find it bundled with XML::Twig
+ from version 3.16.</para>
+
+ </answer>
+ </qandaentry>
+
</qandadiv>
<qandadiv id="xml_problems"><title>Common XML Problems</title>
|