Re: get-charset-range

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Sam wrote:
> we have a problem with get-charset-range:
> on mingw during bootstrap people have been complaining of the error:
> (ERROR_INVALID_HANDLE): The handle is invalid.
> during load of subtypep.lisp.
> this has been traced down to iconv_range().
> this patch fixes the problem:
> 
> --- stream.d	19 May 2005 19:05:33 -0400	1.519
> +++ stream.d	27 May 2005 14:26:56 -0400	
> @@ -4335,7 +4335,11 @@
>            var size_t res = iconv(cd,&inptr,&insize,&outptr,&outsize);
>            if (res == (size_t)(-1)) {
>              var int my_errno = OS_errno;
> -            if (my_errno == EILSEQ) { /* invalid input? */
> +            if (my_errno == EILSEQ
> +               #ifdef WIN32_NATIVE
> +                || my_errno == ERROR_INVALID_HANDLE
> +               #endif
> +                ) { /* invalid input? */
>                end_system_call();
>                # ch not encodable -> finish the interval
>                if (have_i1_i2) {

This patch is nonsense. It replaces a broken code with another broken code.

I explained twice already: iconv() is a function specified in POSIX, and
sets its error value in 'errno', not in 'GetLastError()'. It's broken in
two ways currently:
  1. It uses OS_errno where it should use errno,
  2. It uses OS_error() for signalling the error, where is should use
     ANSIC_error() or POSIX_error().
I'm now committing a patch that fixes both of these.

Now, if it still doesn't work for you or Yaroslav Kavenchuk, you must be
using an iconv() library that is not POSIX compliant or that was built with
incompatible flags. After more than two month of hassles, I don't want to
hear any more about it. It's not worth the time, since clisp supports most
important encodings already natively without iconv(). So I'm adding a
"#undef HAVE_ICONV" in win32.d.

> OTOH, the whole notion of precomputing these ranges is suspect.

?! It's the principle of ahead-of-time compilation.

> there appears to be no real reason to pre-compute the cache during
> bootstrap:
> con:
> 3. risk of error for no gain: most people do not need this cache.

On the contrary: this is a PRO. It allows us to guarantee that the binding
with the iconv() doesn't lead to a clisp that crashes. It allows us to add
the right #ifdefs with __GLIBC_MINOR__ without waiting for 10 users to
report crashes.

> thus I propose the additional patch:
> 
> --- subtypep.lisp	09 Feb 2005 09:39:53 -0500	1.12
> +++ subtypep.lisp	27 May 2005 14:50:13 -0400	
> @@ -1170,9 +1170,6 @@
>        type)))
>  ;; Conversion of an encoding to a list of intervals.
>  #+UNICODE
> -(let ((table (make-hash-table :key-type '(or string symbol) :value-type 'simple-string
> -                              :test 'stablehash-equal :warn-if-needs-rehash-after-gc t)))
> -  ;; cache: charset name -> list of intervals #(start1 end1 ... startm endm)
>    #| ; Now in C and much more efficient.
>    (defun charset-range (encoding start end)
>      (setq start (char-code start))
> @@ -1192,19 +1189,16 @@
>    ;; Return the definition range of a character set. If necessary, compute it
>    ;; and store it in the cache.
>    (defun get-charset-range (charset &optional maxintervals)
> +  (let ((table #.(make-hash-table :key-type '(or string symbol)
> +                                  :value-type 'simple-string
> +                                  :test 'stablehash-equal
> +                                  :warn-if-needs-rehash-after-gc t)))
> +    ;; cache: charset name -> list of intervals #(start1 end1 ... startm endm)
>      (or (gethash charset table)
>          (setf (gethash charset table)
>                (charset-range (make-encoding :charset charset)
>                               (code-char 0) (code-char (1- char-code-limit))
> -                             maxintervals))))
> -  ;; Fill the cache, but cache only the results with small lists of intervals.
> -  ;; Some iconv based encodings have large lists of intervals (up to 5844
> -  ;; intervals for ISO-2022-JP-2) which are rarely used and not worth caching.
> -  (do-external-symbols (sym (find-package "CHARSET"))
> -    (let* ((charset (encoding-charset (symbol-value sym)))
> -           (computed-range (get-charset-range charset 100))
> -           (intervals (/ (length computed-range) 2)))
> -      (when (>= intervals 100) (remhash charset table)))))
> +                             maxintervals)))))
>  #| ;; Older code for a special case.

Don't do this. This is bad.
  1. It removes a build-time verification, leading to possible crashes at
     runtime.
  2. #. and #, are BAD. Never use them if you can avoid them.

Bruno