|
From: Grant M. <gr...@us...> - 2005-02-11 08:06:27
|
Update of /cvsroot/perl-xml/perl-xml-faq In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv13042 Modified Files: perl-xml-faq.xml Log Message: - add Q&A re splitting huge files Index: perl-xml-faq.xml =================================================================== RCS file: /cvsroot/perl-xml/perl-xml-faq/perl-xml-faq.xml,v retrieving revision 1.19 retrieving revision 1.20 diff -u -d -r1.19 -r1.20 --- perl-xml-faq.xml 11 Nov 2004 09:25:15 -0000 1.19 +++ perl-xml-faq.xml 11 Feb 2005 08:06:16 -0000 1.20 @@ -18,6 +18,7 @@ <year>2002</year> <year>2003</year> <year>2004</year> + <year>2005</year> <holder>Grant McLean</holder> </copyright> @@ -1177,19 +1178,16 @@ </answer> </qandaentry> - <qandaentry id="utf8"> - <question> - <para>What is UTF-8?</para> - </question> - <answer> + <qandaentry id="utf8"> <question> <para>What is UTF-8?</para> </question> + <answer> <para>Since Unicode supports character positions higher than 256, a - representation of those characters will obviously require more than - one 8-bit byte. There is more than one system for representing - Unicode characters as byte sequences. UTF-8 is one such system. It - uses a variable number of bytes (1 to 6) to represent each character. - This means that the most common characters (ie: 7 bit ASCII) only - require one byte.</para> + representation of those characters will obviously require more than one + 8-bit byte. There is more than one system for representing Unicode + characters as byte sequences. UTF-8 is one such system. It uses a + variable number of bytes (from 1 to 4 according to RFC3629) to represent + each character. This means that the most common characters (ie: 7 bit + ASCII) only require one byte.</para> <para>In UTF-8 encoded data, the most significant bit of each byte will be 0 for single byte characters and 1 for each byte of a multibyte @@ -1212,8 +1210,6 @@ 2 byte character 110xxxxx 10xxxxxx 3 byte character 1110xxxx 10xxxxxx 10xxxxxx 4 byte character 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx -5 byte character 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx -6 byte character 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx ]]></programlisting> <para>UTF-16 encoding is an alternative byte representation of Unicode @@ -2022,7 +2018,23 @@ </answer> </qandaentry> - + + <qandaentry id="file_split"> + <question> + <para>How can I split a huge XML file into smaller chunks</para> + </question> + <answer> + + <para>When your document is too large to slurp into memory, the DOM, + XPath and XSLT tools can't really help you. You could write your own SAX + filter fairly easily, but Michel Rodriguez has written a <ulink + url="http://www.perlmonks.org/index.pl?node_id=429707">general + solution</ulink> so you don't have to. You'll find it bundled with XML::Twig + from version 3.16.</para> + + </answer> + </qandaentry> + </qandadiv> <qandadiv id="xml_problems"><title>Common XML Problems</title> |