I've also sent this to the mailing list, althogh here would probably be a better spot:
I'm using formencode, which uses pyDNS to look up mx records to
validate e-mail addresses entered on web forms. pyDNS yields a
traceback when trying to mxlookup an unicode string.
I've also submitted this to formencode, but I'm pretty sure it's a
pyDNS problem. formencode bug:
http://sourceforge.net/tracker/index.php?func=detail&aid=2126902&group_id=91231&atid=596416
Here's how to reproduce:
>>> import DNS.Base
>>> DNS.Base.ParseResolvConf()
>>> from DNS.lazy import mxlookup
>>> mxlookup('gmx.at')
[(10, 'mx0.gmx.de'), (10, 'mx0.gmx.net')]
>> mxlookup(u'gmx.at')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/var/lib/python-support/python2.4/DNS/lazy.py", line 26, in
mxlookup a = Base.DnsRequest(name, qtype = 'mx').req().answers
File "/var/lib/python-support/python2.4/DNS/Base.py", line 191, in req
m.addQuestion(qname, qtype, Class.IN)
File "/var/lib/python-support/python2.4/DNS/Lib.py", line 466, in
addQuestion self.addname(qname)
File "/var/lib/python-support/python2.4/DNS/Lib.py", line 133, in
addname self.buf = self.buf + buf
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position 0:
ordinal not in range(128)
The 0xbb byte is not allowed. There are several proposals for internationalization:
http://www.isoc.org/pubpolpillar/docs/i18n-dns-chronology.pdf
If you want the Microsoft solution, simply change "ascii" to "utf8". I did not commit that to CVS because Microsoft did not address how the case-insensitive nature of DNS lookups would map to Unicode. The only standard way of encoding non-ascii characters in DNS is currently IDNA. You should apply IDNA before feeding the 7-bit result to pydns.
Should we add a custom exception that attempts to explain the issue?
What about trying to encode the unicode string into ASCII (=no special characters that need IDNA encoding) and if it succeeds, use that. And if it doesn't throw an exception that you currently do not support IDNA encodings (until you do, at which point you simply do the IDNA encoding there).
The problem is that mxlookup doesn't work even on strings that are unicode, but consist only of ascii characters (like 'gmx.at' in the example given).
The example given works in CVS:
>>> import DNS.Base
>>> DNS.Base.ParseResolvConf()
>>> from DNS.lazy import mxlookup
>>> mxlookup('gmx.at')
[(10, 'mx0.gmx.de'), (10, 'mx0.gmx.net')]
>>> mxlookup(u'gmx.at')
[(10, 'mx0.gmx.de'), (10, 'mx0.gmx.net')]
The diff from release 2.3.3 is:
*** DNS/Lib.py 22 May 2007 20:27:40 -0000 1.11.2.3
--- DNS/Lib.py 17 Sep 2008 17:35:14 -0000 1.11.2.5
***************
*** 94,99 ****
--- 94,100 ----
list = []
for label in string.splitfields(name, '.'):
if label:
+ label = label.encode('ascii')
if len(label) > 63:
raise PackError, 'label too long'
list.append(label)
To flesh out the M$ solution, I would delay encoding the labels until after case folding (and hope unicode case folding is good enough), and then check for long labels *after* encoding to utf8.
The IDNA solution is *not* transparent (an IDNA encoded label is also a perfectly legal ascii label by design). Therefore, it should not be implemented in pydns. It is appropriate only at the application layer. The M$ proposal is reasonable and I would support it (unless they've patented it), provided the details of case folding are worked out.
My two cents on 8-bit DNS is that if the first byte of the label is non-ascii - it should be a type code to select the encoding for the remainder of the label. That provides an escape hatch. UTF8 already provides a BOM, u'FEFF', which should be used to mark utf8 encoded labels.
A patch to implement a M$ inspired version of UTF8 DNS enabled with DNS.UTF8 = True. UTF8 encoded 8-bit labels are flagged with BOM. I haven't seen whether M$ bothers to flag UTF8 records in their implementation. This patch crosses its digits and hopes that the unicode .upper() method in python happens to match the case folding that will eventually be standardized. If not, an appropriate function can be substituted.
diff -c -r1.11.2.5 Lib.py
*** DNS/Lib.py 17 Sep 2008 17:35:14 -0000 1.11.2.5
--- DNS/Lib.py 25 Sep 2008 16:40:18 -0000
***************
*** 29,37 ****
--- 29,40 ----
import Class
import Opcode
import Status
+ import DNS
from Base import DNSError
+ UTF8 = False
+
class UnpackError(DNSError): pass
class PackError(DNSError): pass
***************
*** 93,103 ****
# Redundant dots are ignored.
list = []
for label in string.splitfields(name, '.'):
! if label:
! label = label.encode('ascii')
! if len(label) > 63:
! raise PackError, 'label too long'
! list.append(label)
keys = []
for i in range(len(list)):
key = string.upper(string.joinfields(list[i:], '.'))
--- 96,104 ----
# Redundant dots are ignored.
list = []
for label in string.splitfields(name, '.'):
! if not label:
! raise PackError, 'empty label'
! list.append(label)
keys = []
for i in range(len(list)):
key = string.upper(string.joinfields(list[i:], '.'))
***************
*** 115,121 ****
--- 116,130 ----
index = []
for j in range(i):
label = list[j]
+ try:
+ label = label.encode('ascii')
+ except UnicodeEncodeError:
+ if not DNS.UTF8:
+ raise
+ label = ('\ufeff'+label).encode('utf8')
n = len(label)
+ if n > 63:
+ raise PackError, 'label too long'
if offset + len(buf) < 0x3FFF:
index.append((keys[j], offset + len(buf)))
else:
pydns-2.3.3-3 defaults to IDNA encoding for non-ascii chars. There is an option (selected by setting DNS.LABEL_UTF8 = True) to use what I think is the M$ scheme. Otherwise, DNS.LABEL_ENCODING defaults to 'idna', and can be set to whatever else is desired (although idna and the M$ scheme are the current standard and defacto methods respectively). The r234 tag in CVS has this change.
OT: I am testing enhanced TCP timeout code that would come into play for responses larger than a TCP segment.