q-lang-users Mailing List for Q - Equational Programming Language (Page 9)
Brought to you by:
agraef
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(3) |
Feb
(27) |
Mar
|
Apr
(4) |
May
(11) |
Jun
(5) |
Jul
(5) |
Aug
(6) |
Sep
(15) |
Oct
(28) |
Nov
(8) |
Dec
|
2005 |
Jan
(9) |
Feb
(5) |
Mar
(10) |
Apr
(43) |
May
(8) |
Jun
(31) |
Jul
(45) |
Aug
(17) |
Sep
(8) |
Oct
(30) |
Nov
(2) |
Dec
(6) |
2006 |
Jan
(4) |
Feb
(20) |
Mar
(1) |
Apr
|
May
(92) |
Jun
(179) |
Jul
(26) |
Aug
(65) |
Sep
(36) |
Oct
(38) |
Nov
(44) |
Dec
(68) |
2007 |
Jan
(11) |
Feb
(25) |
Mar
(37) |
Apr
(7) |
May
(83) |
Jun
(77) |
Jul
(44) |
Aug
(4) |
Sep
(28) |
Oct
(53) |
Nov
(12) |
Dec
(21) |
2008 |
Jan
(66) |
Feb
(45) |
Mar
(30) |
Apr
(50) |
May
(9) |
Jun
(18) |
Jul
(11) |
Aug
(6) |
Sep
(4) |
Oct
|
Nov
|
Dec
|
2009 |
Jan
|
Feb
|
Mar
|
Apr
(3) |
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
From: Albert G. <Dr....@t-...> - 2008-01-21 23:54:31
|
John Cowan wrote: > 1) Move the regex function (and possibly the low-level regex functions as well) > back into qlib.c. Regular expressions are defined by Posix but aren't > in any way system-dependent. Yes, in hindsight it seems much more convenient to have this in the prelude. So I'm all for this change, but it's going to break existing code. :( Does anyone object to this? > 2) Move the Q functions that are overridden by clib functions (documented > in 12.20) into the examples. Currently they are loaded but then never used. Hmm, these functions are there so that the library doesn't break if you want/need to remove clib from the prelude. That should still be possible, although I haven't checked it for a while. So I'd rather keep them where they are. Any specific reason why you want them removed, other than tidyness? Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: Albert G. <Dr....@t-...> - 2008-01-21 23:40:39
|
John Cowan wrote: > The current structuring of chapters 10, 11, and 12 has more to do > with where a function is defined and what language it is written in > (Q or C) than with what it does, so the user ends up having to search > all three chapters to find things. That's true, but your proposed reorganization is a lot of work. ;-) I'll do it eventually, hopefully in time for the 8.0 release. -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: Albert G. <Dr....@t-...> - 2008-01-21 23:37:44
|
John Cowan wrote: > Hmm, right. (I should have re-read the existing docs.) How about this? > > bytecopy FromByteString FromIndex ToByteString ToIndex Length I already started rewriting the new get_xxx/put_xxx functions so that you can also read/write slices instead of just single elements. That has the advantage that you can use indices relative to the different int/float types instead of just bytes. -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: John C. <co...@cc...> - 2008-01-21 18:45:17
|
(These arose out of my suggested revision of the documentation.) 1) Move the regex function (and possibly the low-level regex functions as well) back into qlib.c. Regular expressions are defined by Posix but aren't in any way system-dependent. 2) Move the Q functions that are overridden by clib functions (documented in 12.20) into the examples. Currently they are loaded but then never used. -- Unless it was by accident that I had John Cowan offended someone, I never apologized. co...@cc... --Quentin Crisp http://www.ccil.org/~cowan |
From: John C. <co...@cc...> - 2008-01-21 18:42:49
|
The current structuring of chapters 10, 11, and 12 has more to do with where a function is defined and what language it is written in (Q or C) than with what it does, so the user ends up having to search all three chapters to find things. This is particularly painful when using a printed manual, of course, and an index wouldn't help that much unless you know the name of what you are looking for. Since most people load the whole standard prelude, the boundaries between modules are of secondary importance at best. Restructuring by function would make it possible to find all the functions for working with tuples in one place, with just an annotation to note which ones are built-in, part of tuple.q, or whatever, just as is done now in the Clib chapter. I propose the following restructuring: Chapter 10: The Standard Library: Values and Data Structures Predicates [11.5 + 10.8 is* + 11.1 eq, neq] Numeric Functions Basic Numeric Functions [10.1 + 10.2 random, seed + 11.1 abs, max, min, nums, numsby, sgn] Integer Functions [12.19] Floating-Point Functions [rest of 10.2 + 11.10] Rational Numbers [11.12] Complex Numbers [11.11] Sequence Functions Basic Sequence Functions [10.3 + 11.6] List Functions [rest of 11.1] Tuple Functions [11.2] String Functions [11.3 + 12.2] Byte Strings [12.3] References [12.14] Regular Expressions [12.18] Streams [11.8] Function Functions [11.1 curry, curr3, id, uncurry, uncurry3 + 10.8 flip] Lambda Expressions [10.6] Exceptions [10.7 + 11.14] Conversion Functions [10.4, perhaps some from other places] The Standard Type Library [11.7] Chapter 11: The Standard Library: I/O Input/Output Functions Basic I/O [10.5 + 12.4] Formatted I/O [12.5] Files and Directories [12.6] Graphics [11.13] System Information [rest of 10.8 + 12.11] Chapter 12: The Standard Library: Posix Manifest Constants [12.1] Process Control [12.7] Low-Level I/O [12.8] Terminal Operations [12.9] Readline Interface [12.10] Sockets [12.12] Threads [12.13] Time Functions [12.15] Internationalization [12.16] Filename Globbing [12.17] Option Parsing [11.4] The special forms in 11.9 (cond.q) should be moved to Chapter 9. I've probably overlooked some individual functions that should be moved. -- Clear? Huh! Why a four-year-old child John Cowan could understand this report. Run out co...@cc... and find me a four-year-old child. I http://www.ccil.org/~cowan can't make head or tail out of it. --Rufus T. Firefly on government reports |
From: John C. <co...@cc...> - 2008-01-21 17:20:50
|
Albert Graef scripsit: > John Cowan wrote: > > What you've got looks good. For strings, I had something in mind like > > > > get_string Bytes EncodingName StartIndex EndIndex > > > > to decode a portion of a byte string into a string, and > > > > put_string Bytes EncodingName Index String > > Should the indices be byte offsets? Also, put_string would just > overwrite the part of the string at the given offset, right? Yes to both questions. > In that > case it should be easy to get that kind of functionality with existing > routines, I just need to add a function to replace a slice of a byte > string in-place. Hmm, right. (I should have re-read the existing docs.) How about this? bytecopy FromByteString FromIndex ToByteString ToIndex Length > > byte_string EncodingName String > > If I understand this correctly, bytestr already provides that > functionality. Yes, you're right. As I said, I should have re-read the existing docs. -- John Cowan co...@cc... http://ccil.org/~cowan If I have not seen as far as others, it is because giants were standing on my shoulders. --Hal Abelson |
From: Albert G. <Dr....@t-...> - 2008-01-21 07:12:56
|
John Cowan wrote: > What you've got looks good. For strings, I had something in mind like > > get_string Bytes EncodingName StartIndex EndIndex > > to decode a portion of a byte string into a string, and > > put_string Bytes EncodingName Index String Should the indices be byte offsets? Also, put_string would just overwrite the part of the string at the given offset, right? In that case it should be easy to get that kind of functionality with existing routines, I just need to add a function to replace a slice of a byte string in-place. > to encode a string into a portion of a byte string. These allow you to > process character data wherever it might exist in a binary sequence. > However, the latter produces an unpredictable number of bytes, and might > need to be supplemented with a factory > > byte_string EncodingName String If I understand this correctly, bytestr already provides that functionality. That is, you can use 'bytestr (S,CODESET)' to create a byte string in a given encoding from a string, and you can even do 'bytestr (S,CODESET,SIZE)' if you want the byte string to be truncated or zero-padded to fit a given size. Is that what you meant? Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: John C. <co...@cc...> - 2008-01-21 04:24:38
|
Albert Graef scripsit: > John, I hope that this is what you had in mind. It still lacks the "... > and strings using a specified character code" part, though. Considering > that a byte string might actually contain character data in an arbitrary > encoding, it's not clear to me how that should be done, could you > elaborate please? What you've got looks good. For strings, I had something in mind like get_string Bytes EncodingName StartIndex EndIndex to decode a portion of a byte string into a string, and put_string Bytes EncodingName Index String to encode a string into a portion of a byte string. These allow you to process character data wherever it might exist in a binary sequence. However, the latter produces an unpredictable number of bytes, and might need to be supplemented with a factory byte_string EncodingName String -- You escaped them by the will-death John Cowan and the Way of the Black Wheel. co...@cc... I could not. --Great-Souled Sam http://www.ccil.org/~cowan |
From: Albert G. <Dr....@t-...> - 2008-01-20 20:28:00
|
Eddie Rucker wrote: > A 64 bit port would be nice but except for arithmetic, I don't see 64 bits > speeding up interpreter that much (correct me if I'm wrong). No, you're right, but still it's essential to get Q working on 64 bit without any hitches asap. That's the next thing on my TODO list after porting Qt/Q to Qt4 (which is halfway done now). Yes, I promised this before, but I'm getting serious about it now. ;-) > I would like to follow the same evaluation procedure against mzscheme (my other favorite interpreter) and see how Q compares. > Mzscheme is not particularly fast either but is useful because of its design and the libraries (modules) provided. Well, MzScheme seems to do fairly well in the Shootout, actually. I'm going to give it a try. (I already tried the latest version of UMB Scheme, but it doesn't appear to work on my SUSE Linux system.) Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: Eddie R. <er...@bm...> - 2008-01-20 18:06:45
|
Albert wrote: > Unfortunately, I don't see any way to make the current interpreter > go > substantially faster without sacrificing some features which make Q > the > special kind of beast that it is now (unrestricted rewriting rules, > runtime special forms and all that). IMHO, the features are more important than speed to a point. As I've stated= before, Q's performance is plenty for my needs thus=20 far.=20 > and even that will need considerable effort, which I think is better > expended towards module development, the 64 bit port and a bytecode > native code compiler. (Note that the latter is a big project in > itself, Modules are most important thing to me right now. A 64 bit port would be ni= ce but except for arithmetic, I don't see 64 bits=20 speeding up interpreter that much (correct me if I'm wrong). I would like to follow the same evaluation procedure against mzscheme (my o= ther favorite interpreter) and see how Q compares.=20 Mzscheme is not particularly fast either but is useful because of its desig= n and the libraries (modules) provided. Eddie |
From: Albert G. <Dr....@t-...> - 2008-01-20 11:11:36
|
Hi Eddie, I'm taking this to the list since I have no doubt that some here will be interested in this kind of stuff. :) Background: I've started implementing some of the benchmarks of the Language Shootout (http://shootout.alioth.debian.org/), to see exactly where Q stands performance-wise. So far, I've been running some tests using the 'recursive' benchmark. This is just simple recursion and arithmetic, and is notoriously difficult for interpreted "dynamic" languages; e.g., Python is already some 280 times slower than C there. I've attached the Q version of this benchmark, in case anyone wants to give it a try. The good news is that Hugs (the Haskell interpreter) is some 60% slower than Q in this benchmark, which confirms some earlier results; the bad news is that Python is still some 3.5 times faster. It goes without saying that I was hoping to see more favourable results, but that's just what they are, and I'm looking into ways to improve the situation. Unfortunately, I don't see any way to make the current interpreter go substantially faster without sacrificing some features which make Q the special kind of beast that it is now (unrestricted rewriting rules, runtime special forms and all that). Just for the record, here are some of the results I got for the recursive benchmark with N=7 (user cpu times in seconds on an AMD 2500+ running Linux as reported by 'time', just running the corresponding interpreters directly on the source scripts, i.e. no prior bytecode compilation): Python 17.99 1.0 Ruby 25.29 1.41 Q 63.86 3.55 Hugs 101.43 5.64 I'll publish a more complete set of results on the wiki asap. >> Do you think it is spending a lot of time matching rules or >> allocating and deallocating memory? > Pattern matching and special forms, probably. There's some overhead > there that can hardly be avoided without sacrificing maintainability > and/or essential language features like lazy evaluation, > call-by-pattern and views. Ok, to get some hard figures, I've run the interpreter with the GNU profiler now. -O3 was used to compile the interpreter, to make maximum use of inlined functions. Here's what I get when running the recursive benchmark with N=7 (% of total computation time): 30% pattern matcher (match()+matchx(), 223484636 calls) 29% bytecode executor (evalu(), 30 calls) 14% expression (de)allocation (qmfree()+x_alloc(), 319261739 calls) 13% builtin activations (evalb_with_frame(), 184237907 calls) 11% stack manipulations (252745447 calls) The stack manipulations are mostly pushes of function symbols, applications, numbers and lhs values for this benchmark. The remaining 3% mostly go into the "real" computation (executing builtin arithmetic functions, etc.). As I already optimized the hell out of the evalu() function, obviously the most promising candidate for further optimization is the pattern matcher. But I don't really see what can be improved about it (or the memory management and stack manipulation routines, for that matter), except that these routines could be made to use inlining as much as possible, to save a little C function call overhead. These results are hardly surprising, except that maybe the pattern matcher uses a somewhat larger fraction than I expected, but that may be due to the special characteristics of the 'recursive' benchmark. I doubt that much more than 10-15% could be shaved off the running time, and even that will need considerable effort, which I think is better expended towards module development, the 64 bit port and a bytecode -> native code compiler. (Note that the latter is a big project in itself, so don't hold your breath for that yet.) Comments? Cheers, Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: Albert G. <Dr....@t-...> - 2008-01-18 12:32:43
|
John Cowan wrote: > You should probably write a manual section, web page, and/or book chapter on > "Q scripting", loosely defined as Q programs without any rewrite rules. I take it that by "without any rewrite rules" you actually mean "basic stuff we usually do with scripting languages". ;-) Stuff like traversing directories, batch processing of text or images (using ImageMagick), basic web programming etc. If anyone has some ideas which concrete examples should go in there (basic stuff, no 3 manyear projects please ;-), or maybe has some short but instructive Perl/Python/Ruby examples to be ported, I'll have a look, as time permits. As always, any help is appreciated. :) Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: Albert G. <Dr....@t-...> - 2008-01-18 12:11:27
|
John Cowan wrote: >>> Following that, byte vectors (mutable byte strings) would also be >>> a useful addition for dealing with large quantities of homogeneous >>> data; these have to be done in C, but would be far more efficient >>> than any alternative representation. >> Yes, actually I've had something like this on my TODO list for a while, >> in order to provide better support for numeric and signal processing >> applications. > > The proposed R6RS Scheme bytevectors provide read-write access to > any point in a bytevector as signed and unsigned {8,16,32}-bit values, > single and double floats, and strings using a specified character code. Ok, at long last I decided to give this a go. It's is in cvs now. Here's the relevant blurb from qdoc.info: Byte Strings as Mutable C Vectors --------------------------------- As of Q 7.11, `clib' supports a number of additional operations which allow you to treat byte strings as mutable C vectors of signed/unsigned 8/16/32 bit integers or single/double precision floating point numbers. The following functions provide read/write access to the elements of such C vectors. Note that the given index argument `I' is interpreted relative to the corresponding element type. Thus, e.g., `get_int32 B I' returns the `I'th 32 bit integer rather than the integer at byte offset `I'. NOTE: Integer arguments must fit into machine integers, otherwise these operations will fail. Integers passed for floating point arguments will be coerced to floating point values automatically. public extern get_int8 B I, get_int16 B I, get_int32 B I; public extern get_uint8 B I, get_uint16 B I, get_uint32 B I; public extern get_float B I, get_double B I; public extern put_int8 B I X, put_int16 B I X, put_int32 B I X; public extern put_uint8 B I X, put_uint16 B I X, put_uint32 B I X; public extern put_float B I X, put_double B I X; Moreover, the following convenience functions are provided to convert between byte strings and lists of integer/floating point elements. public extern int8_list B, int16_list B, int32_list B; public extern uint8_list B, uint16_list B, uint32_list B; public extern float_list B, double_list B; public extern int8_vect Xs, int16_vect Xs, int32_vect Xs; public extern uint8_vect Xs, uint16_vect Xs, uint32_vect Xs; public extern float_vect Xs, double_vect Xs; --- And a few examples: ==> def B = uint32_vect [100..110] ==> B <<ByteStr>> ==> uint32_list B [100,101,102,103,104,105,106,107,108,109,110] ==> get_uint32 B 1 101 ==> put_uint32 B 1 0xffffffff () ==> uint32_list B [100,4294967295,102,103,104,105,106,107,108,109,110] ==> take 12 $ int8_list B [100,0,0,0,-1,-1,-1,-1,102,0,0,0] ==> float_vect [1..10] <<ByteStr>> ==> float_list _ [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0] Please note that this is just the "bare bones" C interface, but building higher-level Q APIs on top of that should be a piece of cake now. John, I hope that this is what you had in mind. It still lacks the "... and strings using a specified character code" part, though. Considering that a byte string might actually contain character data in an arbitrary encoding, it's not clear to me how that should be done, could you elaborate please? Eddie, do you think that this will be good enough for the qcalc stats stuff and GSL interface we talked about a while ago? Maybe for efficiency there should be an additional function in your forthcoming CSV module to directly convert between numeric data in CSV format and C vectors as byte strings? Cheers, Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: John C. <co...@cc...> - 2008-01-18 07:14:51
|
Albert Graef scripsit: > Welcome to Q, the better Perl. ;-) Seriously, Q does cover 100% of my > scripting needs these days. You should probably write a manual section, web page, and/or book chapter on "Q scripting", loosely defined as Q programs without any rewrite rules. -- "But I am the real Strider, fortunately," John Cowan he said, looking down at them with his face co...@cc... softened by a sudden smile. "I am Aragorn son http://www.ccil.org/~cowan of Arathorn, and if by life or death I can save you, I will." --LotR Book I Chapter 10 |
From: Albert G. <Dr....@t-...> - 2008-01-18 07:00:51
|
John Cowan wrote: >> I get 2111 singlechar entities now. Does that sound right? > That's what I get. I hope it's correct. Ok, great. Otherwise we'll blame the W3C. :) > That's a beautiful script you have there. Welcome to Q, the better Perl. ;-) Seriously, Q does cover 100% of my scripting needs these days. Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: John C. <co...@cc...> - 2008-01-18 06:30:45
|
Albert Graef scripsit: > I get 2111 singlechar entities now. Does that sound right? That's what I get. I hope it's correct. That's a beautiful script you have there. -- When I'm stuck in something boring John Cowan where reading would be impossible or (who loves Asimov too) rude, I often set up math problems for co...@cc... myself and solve them as a way to pass http://www.ccil.org/~cowan the time. --John Jenkins |
From: John C. <co...@cc...> - 2008-01-18 06:24:58
|
Albert Graef scripsit: > <!ENTITY DotDot " ⃜" ><!--COMBINING FOUR DOTS ABOVE --> > > Is this really supposed to be a two-character combination? Yes, it is. It is the character SPACE (U+0020) followed by a combining character, one which is nonspacing and normally sits above, below, left of, or right of another character called its base character. By convention, a nonspacing character placed on a SPACE character becomes the corresponding spacing character in appearance. (Unicode encodes both spacing and nonspacing versions of certain diacritics for backward compatibility; for example, there is both ^ and a COMBINING CIRCUMFLEX.) > Because all I get from " \0x020DC" is a blank followed by the "four > dots above" character. That is either a font problem or a font rendering problem on your system, more probably the latter. Linux is considerably behind both Windows and OS X in getting basic i18n correct, although it provides more localizations (particularly into languages considered non-commercial by the others). -- Evolutionary psychology is the theory John Cowan that men are nothing but horn-dogs, http://www.ccil.org/~cowan and that women only want them for their money. co...@cc... --Susan McCarthy (adapted) |
From: Albert G. <Dr....@t-...> - 2008-01-18 02:43:35
|
John Cowan wrote: > Sure. Note that & and < must be special-cased, because the definition of an > entity may not contain an explicit & or <. Ok, the corrected script is attached. I also updated cvs accordingly and uploaded a new tarball (in testing). I get 2111 singlechar entities now. Does that sound right? Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: Albert G. <Dr....@t-...> - 2008-01-18 02:09:02
|
John Cowan wrote: > Sure. Note that & and < must be special-cased, because the definition of an > entity may not contain an explicit & or <. Ah yes, thanks for pointing that out. I also noticed a few entities like the following: <!ENTITY DotDot " ⃜" ><!--COMBINING FOUR DOTS ABOVE --> Is this really supposed to be a two-character combination? Because all I get from " \0x020DC" is a blank followed by the "four dots above" character. It seems rather odd to define that as an entity, no? Thanks, Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: John C. <co...@cc...> - 2008-01-18 00:52:58
|
Albert Graef scripsit: > I've attached my Q script. It expects the w3centities.ent file in the > current dir, output is written to w3centities.c. Could be interesting to > compare the two scripts, if you're willing to share your Perl solution. Sure. Note that & and < must be special-cased, because the definition of an entity may not contain an explicit & or <. #!/usr/bin/perl -w # Process W3 .ent file into tssl style # Sample input: # <!ENTITY AElig "Æ" ><!--LATIN CAPITAL LETTER AE --> # <entity name='AElig' codepoint='00C6'/> use strict; while (<>) { chomp; my ($entity, $name, $string) = split; next unless defined($entity); next unless $entity eq "<!ENTITY"; # reject cruft next if $name eq "%"; # sample declaration next unless length($string) == 11; # reject non-singletons my $codepoint = substr($string, 4, 5); $codepoint = substr($codepoint, 1, 4) if substr($codepoint, 0, 1) eq "0"; $codepoint = "0026" if $name eq "amp"; $codepoint = "003C" if $name eq "lt"; print " <entity name='$name' codepoint='$codepoint'/>\n"; } -- A mosquito cried out in his pain, John Cowan "A chemist has poisoned my brain!" http://www.ccil.org/~cowan The cause of his sorrow co...@cc... Was para-dichloro- Diphenyltrichloroethane. (aka DDT) |
From: Albert G. <Dr....@t-...> - 2008-01-17 17:58:37
|
John Cowan wrote: > I'll let you know, as I'll be updating TagSoup as well. Great, many thanks! > Just what I did, except that being in a hurry I wrote it in Perl. I've attached my Q script. It expects the w3centities.ent file in the current dir, output is written to w3centities.c. Could be interesting to compare the two scripts, if you're willing to share your Perl solution. Cheers, Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: John C. <co...@cc...> - 2008-01-17 09:35:55
|
Albert Graef scripsit: > BTW, John, thanks for spotting this. That W3C draft just came out, > what a lucky coincidence. ;-) Indeed. Someone's blog pointed me to it, I'm not sure who, and then I incorporated it into the latest release of my TagSoup parser, a SAX parser written in Java that processes arbitrary HTML rather than XML. (plug: see http://tagsoup.info ). > If you happen to keep an eye on this, it would be nice if you could > let me know when the draft gets revised, so that the support in Q can > be updated accordingly. I'll let you know, as I'll be updating TagSoup as well. > (I wrote a little Q script to generate the C code in src/w3centities.c > automatically from the .ent file, which makes this easy. The script > isn't included in the sources right now, but if anyone wants to have > it, just let me know.) Just what I did, except that being in a hurry I wrote it in Perl. > Rob Hubbard wrote: > I'd strip the historical duplicates. > > I left them in. The full list of names is just some 15KB now, not a > big deal even on embedded devices nowadays. > > > I think its okay for an entity to have more than one character. > > I only included the single-char entities for now. This simplifies the > implementation, and is also consistent with the other escapes which > all represent single Unicode characters. If this is a problem then > please let me know. I made the same decisions. -- John Cowan http://www.ccil.org/~cowan co...@cc... Please leave your values Check your assumptions. In fact, at the front desk. check your assumptions at the door. --sign in Paris hotel --Cordelia Vorkosigan |
From: Albert G. <Dr....@t-...> - 2008-01-17 08:32:41
|
This is unrelated, but I took the opportunity to also update the uchar properties table to the latest from ICU 3.8. (Note that this is used to implement the Unicode char type predicates like isalpha.) New tarball at http://sourceforge.net/project/showfiles.php?group_id=96881&package_id=188958 -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: Albert G. <Dr....@t-...> - 2008-01-17 07:14:28
|
John Cowan wrote: > Once you have stripped comments and entities with more than one character > in them, you have a list of 2114 short, plausible names for 1509 useful > Unicode characters. This is what is implemented now. BTW, John, thanks for spotting this. That W3C draft just came out, what a lucky coincidence. ;-) If you happen to keep an eye on this, it would be nice if you could let me know when the draft gets revised, so that the support in Q can be updated accordingly. (I wrote a little Q script to generate the C code in src/w3centities.c automatically from the .ent file, which makes this easy. The script isn't included in the sources right now, but if anyone wants to have it, just let me know.) Rob Hubbard wrote: > I'd strip the historical duplicates. I left them in. The full list of names is just some 15KB now, not a big deal even on embedded devices nowadays. > I think its okay for an entity to have more than one character. I only included the single-char entities for now. This simplifies the implementation, and is also consistent with the other escapes which all represent single Unicode characters. If this is a problem then please let me know. Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |
From: Albert G. <Dr....@t-...> - 2008-01-16 23:18:07
|
I'm resending this in latin1, so that it doesn't end up in junk mail folders. ;-) BTW, does anyone know why Thunderbird 1.5 converts messages sent as utf-8 to base64? That's rather inconvenient. John Cowan wrote: > In particular, the W3C has just released a draft set of unified > character entities from XHTML, MathML, and the ISO sets: see the draft at > http://www.w3.org/TR/2007/WD-xml-entity-names-20071214/ and the unified > list at http://www.w3.org/2003/entities/2007/w3centities-f.ent . Ok, this is in cvs now. I also made available a tarball (snapshot of current cvs) in testing: http://sourceforge.net/project/showfiles.php?group_id=96881&package_id=188958 Here's the blurb from the manual: As of version 7.11 and later, the interpreter also supports symbolic character escapes of the form `\&NAME;', where NAME is any of the XML single character entity names specified in the "XML Entity definitions for Characters", see `http://www.w3.org/TR/xml-entity-names/'. Note that, at the time of this writing, this is still a W3C working draft, so the supported entity names may be subject to change until the final specification comes out; the currently supported entities are described in the draft from 14 December 2007, see `http://www.w3.org/TR/2007/WD-xml-entity-names-20071214/'. Also note that multi-character entities are _not_ supported in this implementation. Examples: ==> "Gr\äf" "Gräf" ==> "Gr\&junk;f" ! Invalid character escape in string constant >>> "Gr\&junk;f" ^ ==> puts "The letter \&phgr; is the 21st letter in the Greek alphabet.\n" The letter ? is the 21st letter in the Greek alphabet. () Enjoy, and please let me know if there's anything that doesn't appear to work right. Cheers, Albert -- Dr. Albert Gr"af Dept. of Music-Informatics, University of Mainz, Germany Email: Dr....@t-..., ag...@mu... WWW: http://www.musikinformatik.uni-mainz.de/ag |