(trivial)need regular way to replace MakeIdentifier (especially pyxbgen)
Brought to you by:
pabigot
I have an alternative makeidentifier, for my personal use (and just for Japanese).
For us, `non-ascii's are NOT meaningless, so it does us no good to subscribe those to be emptystring_xxx.
My alternative makeidentifier is(, but don't care its details):
# alter_make_identifier
import re
_AllAsciiMatch_re = re.compile(r'^[ -~]+$')
_UnderscoreSubstitute_re = re.compile(r'[- .]')
_NonIdentifier_re = re.compile(r'[^a-zA-Z0-9_]')
_PrefixUnderscore_re = re.compile(r'^_+')
_PrefixDigit_re = re.compile(r'^\d+')
_CamelCase_re = re.compile(r'_\w')
import MeCab, romkan
_tagger1 = MeCab.Tagger("-Owakati")
_tagger2 = MeCab.Tagger("-Oyomi")
def MakeIdentifier (s, camel_case=False):
s = unicode(s)
if _AllAsciiMatch_re.match(s): # all ascii
s = _PrefixUnderscore_re.sub('', _NonIdentifier_re.sub('',_UnderscoreSubstitute_re.sub('_', s)))
else:
result = _tagger2.parse(_tagger1.parse(s.encode('utf-8')))
s = "_".join(map(romkan.to_roma, result.decode('utf-8').split(" ")))
s = _PrefixUnderscore_re.sub('', _NonIdentifier_re.sub('',_UnderscoreSubstitute_re.sub('_', s)))
if camel_case:
s = _CamelCase_re.sub(lambda _m: _m.group(0)[1].upper(), s)
if _PrefixDigit_re.match(s):
s = 'n' + s
if 0 == len(s):
s = 'emptyString'
return s
(This makeidentifier convert identifiers which are made up of kanji charactors into ascii.)
I can use it by copying pyxbgen locally, and edit like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
I want regular way to replace makeidentifier without copying pyxbgen.
(but I have no idea.)
OK, that shouldn't be too difficult. I'll try to get it into 1.1.4 sometime in the next couple weeks.
commit 6c81ed77dd03c0f528708db070d84cc8237a3b43
Author: Peter A. Bigot <pabigot@…>
Date: Thu Jun 14 13:13:48 2012 -0500
commit 5c4f4c5c3d80209540c612e82c6d7e762b3d7b70
Author: Peter A. Bigot <pabigot@…>
Date: Thu Jun 14 14:24:46 2012 -0500
:000000 100644 0000000... 463e27c... A examples/unicode_jp/README.txt
:000000 100644 0000000... f434a60... A examples/unicode_jp/check.py
:000000 100644 0000000... 20c5d51... A examples/unicode_jp/data/euc-jp/FG-GML-13-RailCL25000-20080331-0001.xml
:000000 100644 0000000... 5b03078... A examples/unicode_jp/data/euc-jp/FGD_GMLSchema.xsd
:000000 100644 0000000... ba2cb6d... A examples/unicode_jp/data/iso-2022-jp/FG-GML-13-RailCL25000-20080331-0001.xml
:000000 100644 0000000... 17cf8be... A examples/unicode_jp/data/iso-2022-jp/FGD_GMLSchema.xsd
:000000 100644 0000000... aeca7bf... A examples/unicode_jp/data/readme.txt
:000000 100644 0000000... f23a71a... A examples/unicode_jp/data/shift_jis/FG-GML-13-RailCL25000-20080331-0001.xml
:000000 100644 0000000... ff0d3d6... A examples/unicode_jp/data/shift_jis/FGD_GMLSchema-ss.jpg
:000000 100644 0000000... fa6461a... A examples/unicode_jp/data/shift_jis/FGD_GMLSchema.xsd
:000000 100644 0000000... 714eb82... A examples/unicode_jp/data/shift_jis/readme.txt
:000000 100644 0000000... 3da0866... A examples/unicode_jp/data/utf-8/FG-GML-13-RailCL25000-20080331-0001.xml
:000000 100644 0000000... db8a20a... A examples/unicode_jp/data/utf-8/FGD_GMLSchema.xsd
:000000 100755 0000000... fbd00e0... A examples/unicode_jp/pyxbgen_jp
:000000 100755 0000000... f64001b... A examples/unicode_jp/test.sh
I surprised that you investigated also MeCab, romkan.
All test was passed, and no problem.
Thank you, great job!
Replying to hhsprings:
Your transliteration code was very interesting and it only took 15 minutes to install the packages and find romkan.py. I think it makes a much more impressive demo. It also makes clear that Python3 support and Unicode identifiers is very important to have a usable system.
Thank you, great job!
Dou itashimashite. This was a great test case that brought back memories (I took a semester of Japanese after grad school fifteen years ago, but never used it and have almost forgotten everything).
By the way, I currently can credit you only as "hhsprings" in the example README; if you would like credit under your real name, please send me email through sourceforge. I would not include your email address, just your name, and that only if you wish.