#7 Cannot mxlookup unicode strings

open
nobody
None
5
2014-08-18
2008-09-24
Thomas Perl
No

I've also sent this to the mailing list, althogh here would probably be a better spot:

I'm using formencode, which uses pyDNS to look up mx records to
validate e-mail addresses entered on web forms. pyDNS yields a
traceback when trying to mxlookup an unicode string.

I've also submitted this to formencode, but I'm pretty sure it's a
pyDNS problem. formencode bug:

http://sourceforge.net/tracker/index.php?func=detail&aid=2126902&group_id=91231&atid=596416

Here's how to reproduce:

>>> import DNS.Base
>>> DNS.Base.ParseResolvConf()
>>> from DNS.lazy import mxlookup
>>> mxlookup('gmx.at')
[(10, 'mx0.gmx.de'), (10, 'mx0.gmx.net')]
>> mxlookup(u'gmx.at')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/var/lib/python-support/python2.4/DNS/lazy.py", line 26, in
mxlookup a = Base.DnsRequest(name, qtype = 'mx').req().answers
File "/var/lib/python-support/python2.4/DNS/Base.py", line 191, in req
m.addQuestion(qname, qtype, Class.IN)
File "/var/lib/python-support/python2.4/DNS/Lib.py", line 466, in
addQuestion self.addname(qname)
File "/var/lib/python-support/python2.4/DNS/Lib.py", line 133, in
addname self.buf = self.buf + buf
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position 0:
ordinal not in range(128)

Discussion

  • Stuart D. Gathman

    The 0xbb byte is not allowed. There are several proposals for internationalization:

    http://www.isoc.org/pubpolpillar/docs/i18n-dns-chronology.pdf

    If you want the Microsoft solution, simply change "ascii" to "utf8". I did not commit that to CVS because Microsoft did not address how the case-insensitive nature of DNS lookups would map to Unicode. The only standard way of encoding non-ascii characters in DNS is currently IDNA. You should apply IDNA before feeding the 7-bit result to pydns.

     
  • Stuart D. Gathman

    Should we add a custom exception that attempts to explain the issue?

     
  • Thomas Perl

    Thomas Perl - 2008-09-25

    What about trying to encode the unicode string into ASCII (=no special characters that need IDNA encoding) and if it succeeds, use that. And if it doesn't throw an exception that you currently do not support IDNA encodings (until you do, at which point you simply do the IDNA encoding there).

    The problem is that mxlookup doesn't work even on strings that are unicode, but consist only of ascii characters (like 'gmx.at' in the example given).

     
  • Stuart D. Gathman

    The example given works in CVS:
    >>> import DNS.Base
    >>> DNS.Base.ParseResolvConf()
    >>> from DNS.lazy import mxlookup
    >>> mxlookup('gmx.at')
    [(10, 'mx0.gmx.de'), (10, 'mx0.gmx.net')]
    >>> mxlookup(u'gmx.at')
    [(10, 'mx0.gmx.de'), (10, 'mx0.gmx.net')]

    The diff from release 2.3.3 is:
    *** DNS/Lib.py 22 May 2007 20:27:40 -0000 1.11.2.3
    --- DNS/Lib.py 17 Sep 2008 17:35:14 -0000 1.11.2.5
    ***************
    *** 94,99 ****
    --- 94,100 ----
    list = []
    for label in string.splitfields(name, '.'):
    if label:
    + label = label.encode('ascii')
    if len(label) > 63:
    raise PackError, 'label too long'
    list.append(label)

    To flesh out the M$ solution, I would delay encoding the labels until after case folding (and hope unicode case folding is good enough), and then check for long labels *after* encoding to utf8.

    The IDNA solution is *not* transparent (an IDNA encoded label is also a perfectly legal ascii label by design). Therefore, it should not be implemented in pydns. It is appropriate only at the application layer. The M$ proposal is reasonable and I would support it (unless they've patented it), provided the details of case folding are worked out.

     
  • Stuart D. Gathman

    My two cents on 8-bit DNS is that if the first byte of the label is non-ascii - it should be a type code to select the encoding for the remainder of the label. That provides an escape hatch. UTF8 already provides a BOM, u'FEFF', which should be used to mark utf8 encoded labels.

     
  • Stuart D. Gathman

    A patch to implement a M$ inspired version of UTF8 DNS enabled with DNS.UTF8 = True. UTF8 encoded 8-bit labels are flagged with BOM. I haven't seen whether M$ bothers to flag UTF8 records in their implementation. This patch crosses its digits and hopes that the unicode .upper() method in python happens to match the case folding that will eventually be standardized. If not, an appropriate function can be substituted.

    diff -c -r1.11.2.5 Lib.py
    *** DNS/Lib.py 17 Sep 2008 17:35:14 -0000 1.11.2.5
    --- DNS/Lib.py 25 Sep 2008 16:40:18 -0000
    ***************
    *** 29,37 ****
    --- 29,40 ----
    import Class
    import Opcode
    import Status
    + import DNS

    from Base import DNSError

    + UTF8 = False
    +
    class UnpackError(DNSError): pass
    class PackError(DNSError): pass

    ***************
    *** 93,103 ****
    # Redundant dots are ignored.
    list = []
    for label in string.splitfields(name, '.'):
    ! if label:
    ! label = label.encode('ascii')
    ! if len(label) > 63:
    ! raise PackError, 'label too long'
    ! list.append(label)
    keys = []
    for i in range(len(list)):
    key = string.upper(string.joinfields(list[i:], '.'))
    --- 96,104 ----
    # Redundant dots are ignored.
    list = []
    for label in string.splitfields(name, '.'):
    ! if not label:
    ! raise PackError, 'empty label'
    ! list.append(label)
    keys = []
    for i in range(len(list)):
    key = string.upper(string.joinfields(list[i:], '.'))
    ***************
    *** 115,121 ****
    --- 116,130 ----
    index = []
    for j in range(i):
    label = list[j]
    + try:
    + label = label.encode('ascii')
    + except UnicodeEncodeError:
    + if not DNS.UTF8:
    + raise
    + label = ('\ufeff'+label).encode('utf8')
    n = len(label)
    + if n > 63:
    + raise PackError, 'label too long'
    if offset + len(buf) < 0x3FFF:
    index.append((keys[j], offset + len(buf)))
    else:

     
  • Stuart D. Gathman

    pydns-2.3.3-3 defaults to IDNA encoding for non-ascii chars. There is an option (selected by setting DNS.LABEL_UTF8 = True) to use what I think is the M$ scheme. Otherwise, DNS.LABEL_ENCODING defaults to 'idna', and can be set to whatever else is desired (although idna and the M$ scheme are the current standard and defacto methods respectively). The r234 tag in CVS has this change.

    OT: I am testing enhanced TCP timeout code that would come into play for responses larger than a TCP segment.

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks