[clisp-list] (make-encoding [...] :translit) status & alternatives?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

TL;DR - Can you point me to Lisp code that does, or even some table
that has, transliterations to ASCII for some useful set of common
non-ASCII characters, especially those in ISO-8859-1 (Latin-1)?

WHAT I'M REALLY DOING:

I'm extracting descriptive statistics from hundreds of files
containing English text encoded in ISO-8859-1 (Latin-1).  Almost all
of this is ASCII.  For obscure technical reasons this is done by a
process whose *ERROR-OUTPUT* must be ASCII.  I occasionally need to
report errors processing some line, & if the line isn't all ASCII,
describing the line causes the error reporting process to error out.

WHAT I TRIED SO FAR:

I Googled it, of course!  I typed a question into an ordinary
Google search box and got a fascinating answer in the AI Overview.

==================== query in Google search box =====================^
How can I change the output-error-action
for an encoding in Common Lisp?
=============== start of complete Google AI overview ================|
To change the output-error-action for an encoding in Common Lisp,
you need to use the EXT:MAKE-ENCODING function, which is an
implementation-specific extension primarily used in CLISP.  This
allows you to specify how streams handle characters that cannot be
represented by the chosen character set during output operations.

The :output-error-action keyword argument
controls what happens when an output error occurs:

:error - Signals an error if an unencodable character is output (the default).

:ignore - Ignores the character (it is dropped from the output).

:translit - Translates/transliterates the
character to a close equivalent if possible.

Or you can provide a specific replacement character.

Example Usage

Here is how you define a new encoding that ignores unencodable characters or
falls back to transliteration:

--------------------------------
;; Define an encoding that drops unencodable characters
(defvar *my-encoding*
  (ext:make-encoding :charset 'charset:iso-8859-1
                     :line-terminator :unix
                     :output-error-action :ignore))

;; Define an encoding that transliterates characters if possible
(defvar *my-translit-encoding*
  (ext:make-encoding :charset 'charset:iso-8859-1
                     :line-terminator :unix
                     :output-error-action :translit))
--------------------------------

Applying the Encoding

To use this new encoding, you pass the resulting encoding object
as the :external-format argument when opening a file or stream:

--------------------------------

Applying the Encoding

To use this new encoding, you pass the resulting encoding object
as the :external-format argument when opening a file or stream:

--------------------------------
(with-open-file (stream "output.txt"
                        :direction :output
                        :if-exists :supersede
                        :external-format *my-encoding*)
  (write-string "Hello, world!" stream))
--------------------------------
Use code with caution.

Modifying Global System Settings

If you want to change the encoding settings for the standard global I/O
streams in CLISP, you can bind or modify the corresponding variables using
EXT:LETF or EXT:LETF* (as these are typically system symbol-macros):

(ext:letf ((*pathname-encoding* *my-encoding*))
  (print "Your output goes here"))
--------------------------------
Use code with caution.

For more exact details on supported charsets and their
behaviors, consult the official CLISP Encodings Documentation.
================ end of complete Google AI overview =================$

"Fascinating is a word I use for the unexpected."  -- Spock, STTOS 1x17

Why this was fascinating:
- I didn't mention CLISP in my prompt.
  - How did Google AI know I was using CLISP?  Spooky.
- The answer told me *everything* I needed
  in order to code exactly what I wanted.
- The answer told me about :TRANSLIT, which I hadn't
  known about because it wasn't in the impnotes.  Hmm,
  I thought, maybe I'll submit a documentation patch.
- When I happily coded what I needed, it didn't work,
  because CLISP didn't know about :TRANSLIT either.

OK, I thought, this is Debian Bookworm (oldstable) so
it's CLISP v2.49.93+, and I guess Debian Trixie (stable)
must have it.  ...Nope, Trixie has the same version.

Sigh.  How about Fedora?  That has v2.49.95+,
but that still doesn't know about :TRANSLIT.

Heavier sigh.  I need to install and build HEAD?!

But then I found this:
======== from <https://sourceforge.net/p/clisp/feature-requests/51/> =========^
#51 add :output-error-action :translit to make-encoding
   Milestone: None
   Status: open
   Owner: Bruno Haible
   Labels: None
   Priority: 5
   Updated: 2017-04-14
   Created: 2017-04-14
   Creator: Sam Steingold
   Private: No

It would be nice if make-encoding accepted :output-error-action
:translit, using the //TRANSLIT suffix on platforms
which support it (at least GNU libc & GNU libiconv).

See
  - http://clisp.org/impnotes/encoding.html
  - https://sourceforge.net/p/clisp/mailman/message/35787517/
  - https://www.gnu.org/software/libiconv/
====== from <https://sourceforge.net/p/clisp/mailman/message/35787517/> ======|
Re: make-encoding error-action translit?
From: Bruno H. <br...@cl...> - 2017-04-13 23:35:51

Hi Sam,

> Why doesn't make-encoding support
> :input-error-action/:output-error-action value :translit?

You are right, make-encoding :output-error-action :translit
would make a lot of sense when implemented through the
//TRANSLIT suffix (available in GNU libc and GNU libiconv).

make-encoding :input-error-action :translit would be nonsense.

Why clisp doesn't support it?  Purely for historical reasons: Because
clisp's make-encoding function was designed before GNU libiconv and
before I realized how useful the //TRANSLIT suffix in glibc is.

Bruno
==============================================================================$
And that was 9 years ago, & it's still "Status: Open".

So I have questions.

If it is the case that CLISP does not now support and has not
ever supported :TRANSLIT, whence came Google AI's astonishingly
lucid confabulation (a better term than "hallucination")?

If there *is* some Lisp (especially a CLISP) out there that supports
:TRANSLIT, where is it, where's it documented, how can I get it, etc?
And is its documentation a plausible source of Google AI's answer?

Back in this universe:

Wouldn't it be easier for me to just wrap my error
output functions with something handling encoding
errors, so I can do transliterations by hand?

Here's my error output code:
=====================================================================^
(defmacro show (&rest forms)
  `(handler-case
       (progn ,@(loop for form in forms
                      collect `(format *error-output*
                                       "~&~S: ~S~%"
                                       ',form
                                       ,form)))
     (error (e)
       (format *error-output* "~&E: ~S" e))))

(defvar *debug-level* 0
  "How much debugging information should be output")

(defmacro dbg (&rest forms)
  (let ((cutoff (if (realp (car forms))
                  (pop forms)
                  1)))
    `(when (>= *debug-level* ,cutoff)
       ,(if (stringp (car forms))
          `(handler-case
               (format *error-output* ,@forms)
             (error (e)
               (format *error-output* "~&E: ~S" e)))
          `(show ,@forms)))))
=====================================================================$

That allows, for example,
  (show x y z)
to produce 3 lines of output,
  X: [value of X]
etc; or something like
  (dbg 2 "~&Checkpoint 8")
to output "Checkpoint 8" only if -DD was on the command line, while
  (dbg "~&Checkpoint 8")
outputs "Checkpoint 8" only if -D was on the command line.

And they can be combined:
  (dbg 3 x y z)
does (show x y z) only if -DDD was on the command line.

So, for example, running
  (when t
    (setf x (code-char 222))
    (show x)
    (handler-case
        (princ x)
      (error (e)
        (format *error-output* "~&E: ~S" e)
        (when nil
          (format *error-output* "~&E: ~A" e))))
    (bye 66))
gives output
  --------------------------------
  X: #\LATIN_CAPITAL_LETTER_THORN
  E: #<SIMPLE-CHARSET-TYPE-ERROR #x00001000003276D1>
  --------------------------------
which is correct.

But changing the "when nil" above to "when t" gives
  --------------------------------
  X: #\LATIN_CAPITAL_LETTER_THORN
  E: #<SIMPLE-CHARSET-TYPE-ERROR #x00001000003276D1>
  E:
  PRINC: Character #\u00DE cannot be represented in the character set CHARSET:ASCII
  --------------------------------
because the final FORMAT tried to write a
non-ASCII character to an ASCII stream.

I'm planning to hook (format "~A") of ISO-8859-1 characters so
that when a charset error occurs, I can go through the string
character-by-character, transliterating to some appropriate
ASCII string, such as "u:" for "u" with an umlaut.  Because
the text is mostly ASCII, that will only rarely happen,
so the impact on efficiency-of-output will be minimal.

Can anybody offer code that already does that transliteration,
or even just a pointer to a convenient table of transliterations
to ASCII of the characters in the high half of ISO-8859-1?