[Win32forth-cvs] win32forth/src/lib EscapedStrings.f,NONE,1.1

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/win32forth/win32forth/src/lib
In directory 23jxhf1.ch3.sourceforge.com:/tmp/cvs-serv32732

Added Files:
	EscapedStrings.f 
Log Message:
Support for Escaped Strings S\" added

--- NEW FILE: EscapedStrings.f ---
\ RfD: Escaped Strings S\"
\ 19 July 2007, Stephen Pelc
\
\ 20070719 Modified ambiguous condition
\           Added ambiguous conditions to definition of S\"
\           Added test cases
\           Corrected Reference Implementation
\ 20070712 Redrafted non-normative portions.
\ 20060822 Updated solution section.
\ 20060821 First draft.
\
\
\ Rationale
\ =========
\
\
\ Problem
\ -------
\ The word S" 6.1.2165 is the primary word for generating strings.
\ In more complex applications, it suffers from several deficiencies:
\ 1) the S" string can only contain printable characters,
\ 2) the S" string cannot contain the '"' character,
\ 3) the S" string cannot be used with wide characters as discussed
\     in the Forth 200x internationalisation and XCHAR proposals.
\
\
\ Current practice
\ ----------------
\ At least SwiftForth, gForth and VFX Forth support S\" with very
\ similar operations. S\" behaves like S", but uses the '\' character
\ as an escape character for the entry of characters that cannot be
\ used with S".
\
\
\ This technique is widespread in languages other than Forth.
\
\
\ It has benefit in areas such as
\
\
\ 1) construction of multiline strings for display by operating
\     system services,
\ 2) construction of HTTP headers,
\ 3) generation of GSM modem and Telnet control strings.
\
\
\ The majority of current Forth systems contain code, either in the
\ kernel or in application code, that assumes char=byte=au. To avoid
\ breaking existing code, we have to live with this practice.
\
\
\ The following list describes what is currently available in the
\ surveyed Forth systems that support escaped strings.
\
\
\ \a      BEL (alert, ASCII 7)
\ \b      BS (backspace, ASCII 8)
\ \e      ESC (not in C99, ASCII 27)
\ \f      FF (form feed, ASCII 12)
\ \l      LF (ASCII 10)
\ \m      CR/LF pair (ASCII 13, 10) - for HTML etc.
\ \n      newline - CRLF for Windows/DOS, LF for Unices
\ \q      double-quote (ASCII 34)
\ \r      CR (ASCII 13)
\ \t      HT (tab, ASCII 9)
\ \v      VT (ASCII 11)
\ \z      NUL (ASCII 0)
\ \"      "
\ \[0-7]+ Octal numerical character value, finishes at the
\          first non-octal character
\ \x[0-9a-f]+  Hex numerical character value, finishes at the
\          first non-hex character
\ \\      backslash itself
\       before any other character represents that character
\
\
\ Considerations
\ --------------
\ We are trying to integrate several issues:
\
\
\ 1) no/least code breakage
\ 2) minimal standards changes
\ 3) variable width character sets
\ 4) small system functionality
\
\
\ Item 1) is about the common char=byte=au assumption.
\ Item 2) includes the use of COUNT to step through memory and the
\          impact of char in the file word sets.
\ Item 3) has to rationalise a fixed width serial/comms channel
\          with 1..4 byte characters, e.g. UTF-8
\ Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.
\
\
\ The basis of the current approach is to use the terminology of
\ primitive characters and extended characters. A primitive character
\ (called a pchar here) is a fixed-width unit handled by EMIT and
\ friends as well as C@, C! and friends. A pchar corresponds to the
\ current ANS definition of a character. Characters that may be
\ wider than a pchar are called "extended characters" or xchars.
\ The xchars are an integer multiple of pchars. An xchar consists
\ of one or more primitive characters and represents the encoding
\ for a "display unit". A string is represented by caddr/len
\ in terms of primitive characters.
\
\
\ The consequences of this are:
\
\
\ 1) No existing code is broken.
\ 2) Most systems have only one keyboard and only one screen/display
\     unit, but may have several additional comms channels. The
\     impact of a keyboard driver having to convert Chinese or Russian
\     characters into a (say) UTF-8 sequence is minimal compared to
\     handling the key stroke sequences. Similarly on display.
\ 3) Comms channels and files work as expected.
\ 4) 16-bit embedded systems can handle all character widths as they
\     are described as strings.
\ 5) No conflict arises with the XCHARs proposal.
\
\
\ Multiple encodings can be handled if they share a common primitive
\ character size - nearly all encodings are described in terms of
\ octets, e.g. TCP/IP, UTF-8, UTF-16, UTF-32, ...
\
\
\ Approach
\ --------
\ This proposal does not require systems to handle xchars, and does
\ not disenfranchise those that do.
\
\
\ S\" is used like S" but treats the '\' character specially. One
\ or more characters after the  '\' indicate what is substituted.
\ The following three of these cause parsing and readability
\ problems. As far as I know, requiring characters to come in
\ 8 bit units will not upset any systems. Systems with characters
\ less than 7 bits are non-compliant, and I know of no 7 bit CPUs.
\ All current systems use character units of 8 bits or more.
\
\
\ Of observed current practice, the following two are problematic.
\
\
\ \[0-7]+ Octal numerical character value, finishes at the
\          first non-octal character
\
\
\ \x[0-9a-f]+  Hex numerical character value, finishes at the
\          first non-hex character
\
\
\ Why do we need two representations, both of variable length?
\ This proposal selects the hexadecimal representation, requiring
\ two hex digits. A consequence of this is that xchars must be
\ represented as a sequence of pchars. Although initially seen as a
\ problem by some people, it avoids at least the following problems:
\
\
\ 1) Endian issues when transmitting an xchar, e.g. big-endian host
\     to little-endian comms channel
\
\
\ 2) Issues when an xchar is larger than a cell, e.g. UTF-32 on
\     a 16 bit system.
\
\
\ 3) Does not have problems in distinguishing the end of the
\     number from a following character such as '0' or 'A'.
\
\
\ At least one system (Gforth) already supports UTF-8 as its native
\ character set, and one system (JaxForth) used UTF-16. These systems
\ are not affected.
\
\
\       before any other character represents that character
\
\
\ This is an unnecessary general case, and so is not mandated. By
\ making it an ambiguous condition, we do not disenfranchise
\ existing implementations, and leave the way open for future
\ extensions.
\
\
\ Proposal
\ ========
\
\
\ 6.2.xxxx S\"
\ s-slash-quote CORE EXT
\
\
\ Interpretation:
\     Interpretation semantics for this word are undefined.
\
\
\ Compilation: ( "ccc<quote>" -- )
\     Parse ccc delimited by " (double-quote), using the translation
\     rules below. Append the run-time semantics given below to the
\     current definition.
\
\
\ Translation rules:
\     Characters are processed one at a time and appended to the
\     compiled string. If the character is a '\' character it is
\     processed by parsing and substituting one or more characters
\     as follows:
\
\
\     \a      BEL (alert, ASCII 7)
\     \b      BS (backspace, ASCII 8)
\     \e      ESC (not in C99, ASCII 27)
\     \f      FF (form feed, ASCII 12)
\     \l      LF (ASCII 10)
\     \m      CR/LF pair (ASCII 13, 10)
\     \n      implementation dependent newline, e.g. CR/LF, LF, or LF/CR.
\     \q      double-quote (ASCII 34)
\     \r      CR (ASCII 13)
\     \t      HT (tab, ASCII 9)
\     \v      VT (ASCII 11)
\     \z      NUL (ASCII 0)
\     \"      "
\     \xAB    A and B are Hexadecimal numerical characters. The resulting
\             character is the conversion of these two characters. An
\             ambiguous conditions exists if \x is not followed by two
\             hexadecimal characters.
\     \\      backslash itself
\     \       An ambiguous condition exists if a \ is placed before any
\             character, other than those defined in 6.2.xxx s\".
\
\
\ Run-time: ( -- c-addr u )
\     Return c-addr and u describing a string consisting of the translation
\     of the characters ccc. A program shall not alter the returned string.
\
\
\ See: 3.4.1 Parsing, 6.2.0855 C" , 11.6.1.2165 S" , A.6.1.2165 S"
\
\
\ Labelling
\ =========
\ Ambiguous conditions occur:
\    If \x is not followed by two hexadecimal characters.
\    If a \ is placed before any character, other than those defined
\    in 6.2.xxx s\".
\
\
\ Reference Implementation
\ ========================
\ Taken from the VFX Forth source tree and modified to remove most
\ implementation dependencies. Assumes the use of the # and $ numeric
\ prefixes to indicate decimal and hexadecimal respectively.
\
\
\ Another implementation (with some deviations) can be found at
\ http://b2.complang.tuwien.ac.at/cgi-bin/viewcvs.cgi/*checkout*/gforth...

\ Reference Implementation modified for Win32Forth by Dirk Busch

anew -EscapedStrings.f

INTERNAL

: $,            \ caddr len --
\ *G Lay the string into the dictionary at *\fo{HERE}, reserve
\ ** space for it and *\fo{ALIGN} the dictionary.
   dup >r
   here place
   r> 1 chars + allot
   align
;

: addchar       \ char string --
\ *G Add the character to the end of the counted string.
   tuck count + c!
   1 swap c+!
;

: append        \ c-addr u $dest --
\ *G Add the string described by C-ADDR U to the counted string at
\ ** $DEST. The strings must not overlap.
   >r
   tuck  r@ count +  swap cmove          \ add source to end
   r> c+!                                \ add length to count
;

: extract2H     \ caddr len -- caddr' len' u
\ *G Extract a two-digit hex number in the given base from the
\ ** start of the* string, returning the remaining string
\ ** and the converted number.
   base @ >r  hex
   0 0 2over drop 2 >number 2drop drop
   >r  2 /string r>
   r> base !
;

create EscapeTable      \ -- addr
\ *G Table of translations for \a..\z.
   7 c,         \ \a
   8 c,         \ \b
   char c c,    \ \c
   char d c,    \ \d
   #27 c,       \ \e
   #12 c,       \ \f
   char g c,    \ \g
   char h c,    \ \h
   char i c,    \ \i
   char j c,    \ \j
   char k c,    \ \k
   #10 c,       \ \l
   char m c,    \ \m
   #10 c,       \ \n (Unices only)
   char o c,    \ \o
   char p c,    \ \p
   char " c,    \ \q
   #13 c,       \ \r
   char s c,    \ \s
   9 c,         \ \t
   char u c,    \ \u
   #11 c,       \ \v
   char w c,    \ \w
   char x c,    \ \x
   char y c,    \ \y
   0 c,         \ \z

: addEscape     \ caddr len dest -- caddr' len'
\ *G Add an escape sequence to the counted string at dest,
\ ** returning the remaining string.
   over 0=                              \ zero length check
   if  drop  exit  endif
   >r                                        \ -- caddr len ; R: -- dest
   over c@ [char] x = if                        \ hex number?
     1 /string extract2H r> addchar  exit
   endif
   over c@ [char] m = if                        \ CR/LF pair?
     1 /string  #13 r@ addchar  #10 r> addchar  exit
   endif
   over c@ [char] n = if                        \ CR/LF pair?
     1 /string  crlf$ count r> append  exit
   endif
   over c@ [char] a [char] z 1+ within if
     over c@ [char] a - EscapeTable + c@  r> addchar
   else
     over c@ r> addchar
   endif
   1 /string
;

: parse\"  \ caddr len dest -- caddr' len'
\ *G Parses a string up to an unescaped '"', translating '\'
\ ** escapes to characters much as C does. The
\ ** translated string is a counted string at *\i{dest}
\ ** The supported escapes (case sensitive) are:
\ *D \a      BEL (alert)
\ *D \b      BS (backspace)
\ *D \e      ESC (not in C99)
\ *D \f      FF (form feed)
\ *D \l      LF (ASCII 10)
\ *D \m      CR/LF pair - for HTML etc.
\ *D \n      newline - CRLF for Windows/DOS, LF for Unices
\ *D \q      double-quote
\ *D \r      CR (ASCII 13)
\ *D \t      HT (tab)
\ *D \v      VT
\ *D \z      NUL (ASCII 0)
\ *D \"      "
\ *D \xAB    Two char Hex numerical character value
\ *D \\      backslash itself
\ *D \       before any other character represents that character
   dup >r  0 swap c!                 \ zero destination
   begin                                        \ -- caddr len ; R: -- dest
     dup
    while
     over c@ [char] " <>                     \ check for terminator
    while
     over c@ [char] \ = if              \ deal with escapes
       1 /string r@ addEscape
     else                               \ normal character
       over c@ r@ addchar  1 /string
     endif
   repeat then
   dup                                  \ step over terminating "
   if  1 /string  endif
   r> drop
;

: readEscaped   \ "string" -- caddr
\ *G Parses an escaped string from the input stream according to
\ ** the rules of *\fo{parse\"} above, returning the address
\ ** of the translated counted string in *\fo{PAD}.
   source >in @ /string tuck         \ -- len caddr len
   pad parse\" nip
    - >in +!
   pad
;

EXTERNAL

: S\"              \ "string" -- caddr u
\ *G As *\fo{S"}, but translates escaped characters using
\ ** *\fo{parse\"} above.
   readEscaped count  state @ if
     compile (s") $,
   then
; IMMEDIATE

MODULE

\s
\ Test Cases
\ ==========

HEX

: { ; immediate
: -> cr .s RESET-STACKS [char] } parse type ;

( The same tests as for S" )


{ : GC5 S\" XY" ; -> }
{ GC5 SWAP DROP -> 2 }
{ GC5 DROP DUP C@ SWAP CHAR+ C@ -> 58 59 }


( The following are inspired by the gForth test suite )


{ S\" " SWAP DROP -> 0 }


{ S\" \a" SWAP C@ -> 1 07 } \ BEL Bell
{ S\" \b" SWAP C@ -> 1 08 } \ BS  Backspace
{ S\" \e" SWAP C@ -> 1 1B } \ ESC Escape
{ S\" \f" SWAP C@ -> 1 0C } \ FF  Formfeed
{ S\" \l" SWAP C@ -> 1 0A } \ LF  Linefeed
{ S\" \q" SWAP C@ -> 1 22 } \ "   Double Quote
{ S\" \r" SWAP C@ -> 1 0D } \ CR  Carage Return
{ S\" \t" SWAP C@ -> 1 09 } \ TAB Horisontal Tab
{ S\" \v" SWAP C@ -> 1 0B } \ VT  Virtical Tab
{ S\" \z" SWAP C@ -> 1 00 } \ NUL No Character
{ S\" \"" SWAP C@ -> 1 22 } \ "   Double Quote
{ S\" \\" SWAP C@ -> 1 5C } \ \   Back Slash


{ S\" \n" 2DROP -> }                                \ System dependent
{ S\" \m" SWAP DUP C@ SWAP CHAR+ C@ -> 2 0D 0A }    \ CR\LF pair
{ S\" \x1Fa" SWAP DUP C@ SWAP CHAR+ C@ -> 2 1F 61 } \ Specified Char


{ S\" S\\\" \\a\"" EVALUATE SWAP C@ -> 1 7 }

decimal