|
From: Will M. <se...@es...> - 2026-05-26 20:38:58
|
TL;DR - Can you point me to Lisp code that does, or even some table
that has, transliterations to ASCII for some useful set of common
non-ASCII characters, especially those in ISO-8859-1 (Latin-1)?
WHAT I'M REALLY DOING:
I'm extracting descriptive statistics from hundreds of files
containing English text encoded in ISO-8859-1 (Latin-1). Almost all
of this is ASCII. For obscure technical reasons this is done by a
process whose *ERROR-OUTPUT* must be ASCII. I occasionally need to
report errors processing some line, & if the line isn't all ASCII,
describing the line causes the error reporting process to error out.
WHAT I TRIED SO FAR:
I Googled it, of course! I typed a question into an ordinary
Google search box and got a fascinating answer in the AI Overview.
==================== query in Google search box =====================^
How can I change the output-error-action
for an encoding in Common Lisp?
=============== start of complete Google AI overview ================|
To change the output-error-action for an encoding in Common Lisp,
you need to use the EXT:MAKE-ENCODING function, which is an
implementation-specific extension primarily used in CLISP. This
allows you to specify how streams handle characters that cannot be
represented by the chosen character set during output operations.
The :output-error-action keyword argument
controls what happens when an output error occurs:
:error - Signals an error if an unencodable character is output (the default).
:ignore - Ignores the character (it is dropped from the output).
:translit - Translates/transliterates the
character to a close equivalent if possible.
Or you can provide a specific replacement character.
Example Usage
Here is how you define a new encoding that ignores unencodable characters or
falls back to transliteration:
--------------------------------
;; Define an encoding that drops unencodable characters
(defvar *my-encoding*
(ext:make-encoding :charset 'charset:iso-8859-1
:line-terminator :unix
:output-error-action :ignore))
;; Define an encoding that transliterates characters if possible
(defvar *my-translit-encoding*
(ext:make-encoding :charset 'charset:iso-8859-1
:line-terminator :unix
:output-error-action :translit))
--------------------------------
Applying the Encoding
To use this new encoding, you pass the resulting encoding object
as the :external-format argument when opening a file or stream:
--------------------------------
Applying the Encoding
To use this new encoding, you pass the resulting encoding object
as the :external-format argument when opening a file or stream:
--------------------------------
(with-open-file (stream "output.txt"
:direction :output
:if-exists :supersede
:external-format *my-encoding*)
(write-string "Hello, world!" stream))
--------------------------------
Use code with caution.
Modifying Global System Settings
If you want to change the encoding settings for the standard global I/O
streams in CLISP, you can bind or modify the corresponding variables using
EXT:LETF or EXT:LETF* (as these are typically system symbol-macros):
(ext:letf ((*pathname-encoding* *my-encoding*))
(print "Your output goes here"))
--------------------------------
Use code with caution.
For more exact details on supported charsets and their
behaviors, consult the official CLISP Encodings Documentation.
================ end of complete Google AI overview =================$
"Fascinating is a word I use for the unexpected." -- Spock, STTOS 1x17
Why this was fascinating:
- I didn't mention CLISP in my prompt.
- How did Google AI know I was using CLISP? Spooky.
- The answer told me *everything* I needed
in order to code exactly what I wanted.
- The answer told me about :TRANSLIT, which I hadn't
known about because it wasn't in the impnotes. Hmm,
I thought, maybe I'll submit a documentation patch.
- When I happily coded what I needed, it didn't work,
because CLISP didn't know about :TRANSLIT either.
OK, I thought, this is Debian Bookworm (oldstable) so
it's CLISP v2.49.93+, and I guess Debian Trixie (stable)
must have it. ...Nope, Trixie has the same version.
Sigh. How about Fedora? That has v2.49.95+,
but that still doesn't know about :TRANSLIT.
Heavier sigh. I need to install and build HEAD?!
But then I found this:
======== from <https://sourceforge.net/p/clisp/feature-requests/51/> =========^
#51 add :output-error-action :translit to make-encoding
Milestone: None
Status: open
Owner: Bruno Haible
Labels: None
Priority: 5
Updated: 2017-04-14
Created: 2017-04-14
Creator: Sam Steingold
Private: No
It would be nice if make-encoding accepted :output-error-action
:translit, using the //TRANSLIT suffix on platforms
which support it (at least GNU libc & GNU libiconv).
See
- http://clisp.org/impnotes/encoding.html
- https://sourceforge.net/p/clisp/mailman/message/35787517/
- https://www.gnu.org/software/libiconv/
====== from <https://sourceforge.net/p/clisp/mailman/message/35787517/> ======|
Re: make-encoding error-action translit?
From: Bruno H. <br...@cl...> - 2017-04-13 23:35:51
Hi Sam,
> Why doesn't make-encoding support
> :input-error-action/:output-error-action value :translit?
You are right, make-encoding :output-error-action :translit
would make a lot of sense when implemented through the
//TRANSLIT suffix (available in GNU libc and GNU libiconv).
make-encoding :input-error-action :translit would be nonsense.
Why clisp doesn't support it? Purely for historical reasons: Because
clisp's make-encoding function was designed before GNU libiconv and
before I realized how useful the //TRANSLIT suffix in glibc is.
Bruno
==============================================================================$
And that was 9 years ago, & it's still "Status: Open".
So I have questions.
If it is the case that CLISP does not now support and has not
ever supported :TRANSLIT, whence came Google AI's astonishingly
lucid confabulation (a better term than "hallucination")?
If there *is* some Lisp (especially a CLISP) out there that supports
:TRANSLIT, where is it, where's it documented, how can I get it, etc?
And is its documentation a plausible source of Google AI's answer?
Back in this universe:
Wouldn't it be easier for me to just wrap my error
output functions with something handling encoding
errors, so I can do transliterations by hand?
Here's my error output code:
=====================================================================^
(defmacro show (&rest forms)
`(handler-case
(progn ,@(loop for form in forms
collect `(format *error-output*
"~&~S: ~S~%"
',form
,form)))
(error (e)
(format *error-output* "~&E: ~S" e))))
(defvar *debug-level* 0
"How much debugging information should be output")
(defmacro dbg (&rest forms)
(let ((cutoff (if (realp (car forms))
(pop forms)
1)))
`(when (>= *debug-level* ,cutoff)
,(if (stringp (car forms))
`(handler-case
(format *error-output* ,@forms)
(error (e)
(format *error-output* "~&E: ~S" e)))
`(show ,@forms)))))
=====================================================================$
That allows, for example,
(show x y z)
to produce 3 lines of output,
X: [value of X]
etc; or something like
(dbg 2 "~&Checkpoint 8")
to output "Checkpoint 8" only if -DD was on the command line, while
(dbg "~&Checkpoint 8")
outputs "Checkpoint 8" only if -D was on the command line.
And they can be combined:
(dbg 3 x y z)
does (show x y z) only if -DDD was on the command line.
So, for example, running
(when t
(setf x (code-char 222))
(show x)
(handler-case
(princ x)
(error (e)
(format *error-output* "~&E: ~S" e)
(when nil
(format *error-output* "~&E: ~A" e))))
(bye 66))
gives output
--------------------------------
X: #\LATIN_CAPITAL_LETTER_THORN
E: #<SIMPLE-CHARSET-TYPE-ERROR #x00001000003276D1>
--------------------------------
which is correct.
But changing the "when nil" above to "when t" gives
--------------------------------
X: #\LATIN_CAPITAL_LETTER_THORN
E: #<SIMPLE-CHARSET-TYPE-ERROR #x00001000003276D1>
E:
PRINC: Character #\u00DE cannot be represented in the character set CHARSET:ASCII
--------------------------------
because the final FORMAT tried to write a
non-ASCII character to an ASCII stream.
I'm planning to hook (format "~A") of ISO-8859-1 characters so
that when a charset error occurs, I can go through the string
character-by-character, transliterating to some appropriate
ASCII string, such as "u:" for "u" with an umlaut. Because
the text is mostly ASCII, that will only rarely happen,
so the impact on efficiency-of-output will be minimal.
Can anybody offer code that already does that transliteration,
or even just a pointer to a convenient table of transliterations
to ASCII of the characters in the high half of ISO-8859-1?
|