From: Colin P. A. <co...@co...> - 2007-02-10 14:13:59
|
>>>>> "Colin" == Colin Paul Adams <co...@co...> writes: >>>>> "Eric" == Eric Bezault <er...@go...> writes: Eric> UTF-8 byte representation? Even if it's a multibyte, you can Eric> replace [\i:] by (multibyte1|...|multibyten|:) and likewise Eric> for similar regexp constructs. Colin> While that is possible, I think the resultant string for Colin> \i, \c and similar properties will be of the order of 20KB. Colin> I don't know how the regular expression engine works, but Colin> if it needs to compare a space character (for instance) Colin> with each of 20000 characters in order to reject a test, I Colin> think it will be far too inefficient. I wrote a test program to measure this. For XML 1.1, the equivalent to \c is 3830417 bytes long. This is definitely too big, so something is wrong with the test program. Can anyone see where the fault is (I know it's not the most efficient way of doing it): class TEST inherit UC_UNICODE_FACTORY UC_UNICODE_CONSTANTS XM_UNICODE_CHARACTERS_1_1 KL_IMPORTED_STRING_ROUTINES create make feature {NONE} -- Initialization make is -- Test byte count of equivalent regexp to [\c]+. local i: INTEGER l_regexp: STRING do from l_regexp := "" i := 1 until i > maximum_unicode_character_code loop if is_name_char (i) then l_regexp := STRING_.appended_string (l_regexp, new_unicode_string_filled_code (i, 1)) end i := i + 1 end print (utf8.to_utf8 (l_regexp).count.out + "%N") end end -- Colin Adams Preston Lancashire |