---------- Forwarded message ----------
From: Edward K. Ream <edreamleo@gmail.com>
Date: Thu, Apr 18, 2013 at 7:37 PM
Subject: Re: [Docutils-develop] Consistency check in utils/punctuation_chars.py fails for delimiters on Windows 7 with Python 2.
To: Guenter Milde <milde@users.sf.net>


On Thu, Apr 18, 2013 at 6:15 PM, Guenter Milde <milde@users.sf.net> wrote:

> Thanks for your analysis,

> GŁnter

And my apologies for getting your name wrong.† As you can see above, Google Gmail has trouble with unicode :-)


> Which one, the newly generated or the hardcoded set?

The result is for for compare(delimiters,d), with delimiters and d defined as in punctuation_chars.py.† As the first line shows::

††† 434 366

There are 434 chars in delimiters, and 366 in d.† If I read the code correctly, d is the result of punctuation_samples, and delimiters are the hard-code constants.†

Simple changes to the script yields this very fast way to compute the delimiters:

ch = unichr if sys.version_info < (3,) else chr
delimiters = ''.join([
††† ch(45), # hyphen-minus
††† ch(47), # solidus
††† ch(58), # colon
††† ch(92), # reverse solidus
††† ch(161), # inverted exclamation mark
††† ch(183), # middle dot
††† ch(191), # inverted question mark
††† ch(894), # greek question mark
††† ch(903), # greek ano teleia
††† ch(1370), # armenian apostrophe
††† ch(1371), # armenian emphasis mark
††† ch(1372), # armenian exclamation mark
††† ch(1373), # armenian comma
††† ch(1374), # armenian question mark
††† ch(1375), # armenian abbreviation mark
††† ch(1417), # armenian full stop
††† ch(1418), # armenian hyphen
††† ch(1470), # hebrew punctuation maqaf
††† ch(1472), # hebrew punctuation paseq
††† ch(1475), # hebrew punctuation sof pasuq
††† ch(1478), # hebrew punctuation nun hafukha
††† ch(1523), # hebrew punctuation geresh
††† ch(1524), # hebrew punctuation gershayim
††† ch(1545), # arabic-indic per mille sign
††† ch(1546), # arabic-indic per ten thousand sign
††† ch(1548), # arabic comma
††† ch(1549), # arabic date separator
††† ch(1563), # arabic semicolon
††† ch(1566), # arabic triple dot punctuation mark
††† ch(1567), # arabic question mark
††† ch(1642), # arabic percent sign
††† ch(1643), # arabic decimal separator
††† ch(1644), # arabic thousands separator
††† ch(1645), # arabic five pointed star
††† ch(1748), # arabic full stop
††† ch(1792), # syriac end of paragraph
††† ch(1793), # syriac supralinear full stop
††† ch(1794), # syriac sublinear full stop
††† ch(1795), # syriac supralinear colon
††† ch(1796), # syriac sublinear colon
††† ch(1797), # syriac horizontal colon
††† ch(1798), # syriac colon skewed left
††† ch(1799), # syriac colon skewed right
††† ch(1800), # syriac supralinear colon skewed left
††† ch(1801), # syriac sublinear colon skewed right
††† ch(1802), # syriac contraction
††† ch(1803), # syriac harklean obelus
††† ch(1804), # syriac harklean metobelus
††† ch(1805), # syriac harklean asteriscus
††† ch(2039), # nko symbol gbakurunen
††† ch(2040), # nko comma
††† ch(2041), # nko exclamation mark
††† ch(2096), # samaritan punctuation nequdaa
††† ch(2097), # samaritan punctuation afsaaq
††† ch(2098), # samaritan punctuation anged
††† ch(2099), # samaritan punctuation bau
††† ch(2100), # samaritan punctuation atmaau
††† ch(2101), # samaritan punctuation shiyyaalaa
††† ch(2102), # samaritan abbreviation mark
††† ch(2103), # samaritan punctuation melodic qitsa
††† ch(2104), # samaritan punctuation ziqaa
††† ch(2105), # samaritan punctuation qitsa
††† ch(2106), # samaritan punctuation zaef
††† ch(2107), # samaritan punctuation turu
††† ch(2108), # samaritan punctuation arkaanu
††† ch(2109), # samaritan punctuation sof mashfaat
††† ch(2110), # samaritan punctuation annaau
††† ch(2404), # devanagari danda
††† ch(2405), # devanagari double danda
††† ch(2416), # devanagari abbreviation sign
††† ch(3572), # sinhala punctuation kunddaliya
††† ch(3663), # thai character fongman
††† ch(3674), # thai character angkhankhu
††† ch(3675), # thai character khomut
††† ch(3844), # tibetan mark initial yig mgo mdun ma
††† ch(3845), # tibetan mark closing yig mgo sgab ma
††† ch(3846), # tibetan mark caret yig mgo phur shad ma
††† ch(3847), # tibetan mark yig mgo tsheg shad ma
††† ch(3848), # tibetan mark sbrul shad
††† ch(3849), # tibetan mark bskur yig mgo
††† ch(3850), # tibetan mark bka- shog yig mgo
††† ch(3851), # tibetan mark intersyllabic tsheg
††† ch(3852), # tibetan mark delimiter tsheg bstar
††† ch(3853), # tibetan mark shad
††† ch(3854), # tibetan mark nyis shad
††† ch(3855), # tibetan mark tsheg shad
††† ch(3856), # tibetan mark nyis tsheg shad
††† ch(3857), # tibetan mark rin chen spungs shad
††† ch(3858), # tibetan mark rgya gram shad
††† ch(3973), # tibetan mark paluta
††† ch(4048), # tibetan mark bska- shog gi mgo rgyan
††† ch(4049), # tibetan mark mnyam yig gi mgo rgyan
††† ch(4050), # tibetan mark nyis tsheg
††† ch(4051), # tibetan mark initial brda rnying yig mgo mdun ma
††† ch(4052), # tibetan mark closing brda rnying yig mgo sgab ma
††† ch(4170), # myanmar sign little section
††† ch(4171), # myanmar sign section
††† ch(4172), # myanmar symbol locative
††† ch(4173), # myanmar symbol completed
††† ch(4174), # myanmar symbol aforementioned
††† ch(4175), # myanmar symbol genitive
††† ch(4347), # georgian paragraph separator
††† ch(4961), # ethiopic wordspace
††† ch(4962), # ethiopic full stop
††† ch(4963), # ethiopic comma
††† ch(4964), # ethiopic semicolon
††† ch(4965), # ethiopic colon
††† ch(4966), # ethiopic preface colon
††† ch(4967), # ethiopic question mark
††† ch(4968), # ethiopic paragraph separator
††† ch(5120), # canadian syllabics hyphen
††† ch(5741), # canadian syllabics chi sign
††† ch(5742), # canadian syllabics full stop
††† ch(5867), # runic single punctuation
††† ch(5868), # runic multiple punctuation
††† ch(5869), # runic cross punctuation
††† ch(5941), # philippine single punctuation
††† ch(5942), # philippine double punctuation
††† ch(6100), # khmer sign khan
††† ch(6101), # khmer sign bariyoosan
††† ch(6102), # khmer sign camnuc pii kuuh
††† ch(6104), # khmer sign beyyal
††† ch(6105), # khmer sign phnaek muan
††† ch(6106), # khmer sign koomuut
††† ch(6144), # mongolian birga
††† ch(6145), # mongolian ellipsis
††† ch(6146), # mongolian comma
††† ch(6147), # mongolian full stop
††† ch(6148), # mongolian colon
††† ch(6149), # mongolian four dots
††† ch(6150), # mongolian todo soft hyphen
††† ch(6151), # mongolian sibe syllable boundary marker
††† ch(6152), # mongolian manchu comma
††† ch(6153), # mongolian manchu full stop
††† ch(6154), # mongolian nirugu
††† ch(6468), # limbu exclamation mark
††† ch(6469), # limbu question mark
††† ch(6622), # new tai lue sign lae
††† ch(6623), # new tai lue sign laev
††† ch(6686), # buginese pallawa
††† ch(6687), # buginese end of section
††† ch(6816), # tai tham sign wiang
††† ch(6817), # tai tham sign wiangwaak
††† ch(6818), # tai tham sign sawan
††† ch(6819), # tai tham sign keow
††† ch(6820), # tai tham sign hoy
††† ch(6821), # tai tham sign dokmai
††† ch(6822), # tai tham sign reversed rotated rana
††† ch(6824), # tai tham sign kaan
††† ch(6825), # tai tham sign kaankuu
††† ch(6826), # tai tham sign satkaan
††† ch(6827), # tai tham sign satkaankuu
††† ch(6828), # tai tham sign hang
††† ch(6829), # tai tham sign caang
††† ch(7002), # balinese panti
††† ch(7003), # balinese pamada
††† ch(7004), # balinese windu
††† ch(7005), # balinese carik pamungkah
††† ch(7006), # balinese carik siki
††† ch(7007), # balinese carik pareren
††† ch(7008), # balinese pameneng
††† ch(7227), # lepcha punctuation ta-rol
††† ch(7228), # lepcha punctuation nyet thyoom ta-rol
††† ch(7229), # lepcha punctuation cer-wa
††† ch(7230), # lepcha punctuation tshook cer-wa
††† ch(7231), # lepcha punctuation tshook
††† ch(7294), # ol chiki punctuation mucaad
††† ch(7295), # ol chiki punctuation double mucaad
††† ch(7379), # vedic sign nihshvasa
††† ch(8208), # hyphen
††† ch(8209), # non-breaking hyphen
††† ch(8210), # figure dash
††† ch(8211), # en dash
††† ch(8212), # em dash
††† ch(8213), # horizontal bar
††† ch(8214), # double vertical line
††† ch(8215), # double low line
††† ch(8224), # dagger
††† ch(8225), # double dagger
††† ch(8226), # bullet
††† ch(8227), # triangular bullet
††† ch(8228), # one dot leader
††† ch(8229), # two dot leader
††† ch(8230), # horizontal ellipsis
††† ch(8231), # hyphenation point
††† ch(8240), # per mille sign
††† ch(8241), # per ten thousand sign
††† ch(8242), # prime
††† ch(8243), # double prime
††† ch(8244), # triple prime
††† ch(8245), # reversed prime
††† ch(8246), # reversed double prime
††† ch(8247), # reversed triple prime
††† ch(8248), # caret
††† ch(8251), # reference mark
††† ch(8252), # double exclamation mark
††† ch(8253), # interrobang
††† ch(8254), # overline
††† ch(8257), # caret insertion point
††† ch(8258), # asterism
††† ch(8259), # hyphen bullet
††† ch(8263), # double question mark
††† ch(8264), # question exclamation mark
††† ch(8265), # exclamation question mark
††† ch(8266), # tironian sign et
††† ch(8267), # reversed pilcrow sign
††† ch(8268), # black leftwards bullet
††† ch(8269), # black rightwards bullet
††† ch(8270), # low asterisk
††† ch(8271), # reversed semicolon
††† ch(8272), # close up
††† ch(8273), # two asterisks aligned vertically
††† ch(8275), # swung dash
††† ch(8277), # flower punctuation mark
††† ch(8278), # three dot punctuation
††† ch(8279), # quadruple prime
††† ch(8280), # four dot punctuation
††† ch(8281), # five dot punctuation
††† ch(8282), # two dot punctuation
††† ch(8283), # four dot mark
††† ch(8284), # dotted cross
††† ch(8285), # tricolon
††† ch(8286), # vertical four dots
††† ch(11513), # coptic old nubian full stop
††† ch(11514), # coptic old nubian direct question mark
††† ch(11515), # coptic old nubian indirect question mark
††† ch(11516), # coptic old nubian verse divider
††† ch(11518), # coptic full stop
††† ch(11519), # coptic morphological divider
††† ch(11776), # right angle substitution marker
††† ch(11777), # right angle dotted substitution marker
††† ch(11782), # raised interpolation marker
††† ch(11783), # raised dotted interpolation marker
††† ch(11784), # dotted transposition marker
††† ch(11787), # raised square
††† ch(11790), # editorial coronis
††† ch(11791), # paragraphos
††† ch(11792), # forked paragraphos
††† ch(11793), # reversed forked paragraphos
††† ch(11794), # hypodiastole
††† ch(11795), # dotted obelos
††† ch(11796), # downwards ancora
††† ch(11797), # upwards ancora
††† ch(11798), # dotted right-pointing angle
††† ch(11799), # double oblique hyphen
††† ch(11800), # inverted interrobang
††† ch(11801), # palm branch
††† ch(11802), # hyphen with diaeresis
††† ch(11803), # tilde with ring above
††† ch(11806), # tilde with dot above
††† ch(11807), # tilde with dot below
††† ch(11818), # two dots over one dot punctuation
††† ch(11819), # one dot over two dots punctuation
††† ch(11820), # squared four dot punctuation
††† ch(11821), # five dot mark
††† ch(11822), # reversed question mark
††† ch(11824), # ring point
††† ch(11825), # word separator middle dot
††† ch(12289), # ideographic comma
††† ch(12290), # ideographic full stop
††† ch(12291), # ditto mark
††† ch(12316), # wave dash
††† ch(12336), # wavy dash
††† ch(12349), # part alternation mark
††† ch(12448), # katakana-hiragana double hyphen
††† ch(12539), # katakana middle dot
††† ch(42238), # lisu punctuation comma
††† ch(42239), # lisu punctuation full stop
††† ch(42509), # vai comma
††† ch(42510), # vai full stop
††† ch(42511), # vai question mark
††† ch(42611), # slavonic asterisk
††† ch(42622), # cyrillic kavyka
††† ch(42738), # bamum njaemli
††† ch(42739), # bamum full stop
††† ch(42740), # bamum colon
††† ch(42741), # bamum comma
††† ch(42742), # bamum semicolon
††† ch(42743), # bamum question mark
††† ch(43124), # phags-pa single head mark
††† ch(43125), # phags-pa double head mark
††† ch(43126), # phags-pa mark shad
††† ch(43127), # phags-pa mark double shad
††† ch(43214), # saurashtra danda
††† ch(43215), # saurashtra double danda
††† ch(43256), # devanagari sign pushpika
††† ch(43257), # devanagari gap filler
††† ch(43258), # devanagari caret
††† ch(43310), # kayah li sign cwi
††† ch(43311), # kayah li sign shya
††† ch(43359), # rejang section mark
††† ch(43457), # javanese left rerenggan
††† ch(43458), # javanese right rerenggan
††† ch(43459), # javanese pada andap
††† ch(43460), # javanese pada madya
††† ch(43461), # javanese pada luhur
††† ch(43462), # javanese pada windu
††† ch(43463), # javanese pada pangkat
††† ch(43464), # javanese pada lingsa
††† ch(43465), # javanese pada lungsi
††† ch(43466), # javanese pada adeg
††† ch(43467), # javanese pada adeg adeg
††† ch(43468), # javanese pada piseleh
††† ch(43469), # javanese turned pada piseleh
††† ch(43486), # javanese pada tirta tumetes
††† ch(43487), # javanese pada isen-isen
††† ch(43612), # cham punctuation spiral
††† ch(43613), # cham punctuation danda
††† ch(43614), # cham punctuation double danda
††† ch(43615), # cham punctuation triple danda
††† ch(43742), # tai viet symbol ho hoi
††† ch(43743), # tai viet symbol koi koi
††† ch(44011), # meetei mayek cheikhei
††† ch(65040), # presentation form for vertical comma
††† ch(65041), # presentation form for vertical ideographic comma
††† ch(65042), # presentation form for vertical ideographic full stop
††† ch(65043), # presentation form for vertical colon
††† ch(65044), # presentation form for vertical semicolon
††† ch(65045), # presentation form for vertical exclamation mark
††† ch(65046), # presentation form for vertical question mark
††† ch(65049), # presentation form for vertical horizontal ellipsis
††† ch(65072), # presentation form for vertical two dot leader
††† ch(65073), # presentation form for vertical em dash
††† ch(65074), # presentation form for vertical en dash
††† ch(65093), # sesame dot
††† ch(65094), # white sesame dot
††† ch(65097), # dashed overline
††† ch(65098), # centreline overline
††† ch(65099), # wavy overline
††† ch(65100), # double wavy overline
††† ch(65104), # small comma
††† ch(65105), # small ideographic comma
††† ch(65106), # small full stop
††† ch(65108), # small semicolon
††† ch(65109), # small colon
††† ch(65110), # small question mark
††† ch(65111), # small exclamation mark
††† ch(65112), # small em dash
††† ch(65119), # small number sign
††† ch(65120), # small ampersand
††† ch(65121), # small asterisk
††† ch(65123), # small hyphen-minus
††† ch(65128), # small reverse solidus
††† ch(65130), # small percent sign
††† ch(65131), # small commercial at
††† ch(65281), # fullwidth exclamation mark
††† ch(65282), # fullwidth quotation mark
††† ch(65283), # fullwidth number sign
††† ch(65285), # fullwidth percent sign
††† ch(65286), # fullwidth ampersand
††† ch(65287), # fullwidth apostrophe
††† ch(65290), # fullwidth asterisk
††† ch(65292), # fullwidth comma
††† ch(65293), # fullwidth hyphen-minus
††† ch(65294), # fullwidth full stop
††† ch(65295), # fullwidth solidus
††† ch(65306), # fullwidth colon
††† ch(65307), # fullwidth semicolon
††† ch(65311), # fullwidth question mark
††† ch(65312), # fullwidth commercial at
††† ch(65340), # fullwidth reverse solidus
††† ch(65377), # halfwidth ideographic full stop
††† ch(65380), # halfwidth ideographic comma
††† ch(65381), # halfwidth katakana middle dot
])

Having docutils compute *all* unicode constants this way would be an important step towards creating a common code base that will run unchanged on either Python 2 or 3.† In particular, the ur'string' constants must go.† Alas, Python 3.3 doesn't handle them like u'string' constants.

A common code base for docutils would be extremely useful for Leo.† It would also seem useful for docutils itself: it would eliminate the 2to3 gymnastics presently required, and should ease future maintenance.

Leo is presently using a kludged-up port of docutils that runs unchanged on either 2 or 3.† (This is only a fallback when the user hasn't installed a proper docutils.) It is just good enough to run docutils on Leo's help strings, but many unit tests pass fail on both 2 and 3.† The next step will be a completely new port, one that ensures that unit tests pass on both 2 and 3 at each stage of the port.† I'll let you know when that happens, and I'll tell you in detail how I did it, if you are interested.

Edward
------------------------------------------------------------------------------
Edward K. Ream email: edreamleo@gmail.com
Leo: http://webpages.charter.net/edreamleo/front.html
------------------------------------------------------------------------------