tv_grab_br_uol HTML Parser Error

Help
2005-04-30
2013-04-17
  • Marcos Lenharo
    Marcos Lenharo
    2005-04-30

    Hi All,

    I'm trying to use the tv_grab_br_uol but I get the following error (maybe the structure of url has changed?)
    ------------------------------------------------------------------------------------------------------------
    Entre a localizao do arquivo de configurao (padro='/home/lenharo/.xmltv/tv_grab_br_uol.conf'):
    INFO: reloading channels from source...
    ERRO: Recebi exceo: <HTMLParser.HTMLParseError instance at 0xb78aeeec>
    Traceback (most recent call last):
      File "/usr/lib/python2.3/site-packages/tvgrab/urlutils.py", line 268, in get_htmlstructure
        parser.close()
      File "/usr/lib/python2.3/HTMLParser.py", line 112, in close
        self.goahead(1)
      File "/usr/lib/python2.3/HTMLParser.py", line 164, in goahead
        self.error("EOF in middle of construct")
      File "/usr/lib/python2.3/HTMLParser.py", line 115, in error
        raise HTMLParseError(message, self.getpos())
    HTMLParseError: EOF in middle of construct, at line 1, column 35722
    ERRO: Recebi exceo: <HTMLParser.HTMLParseError instance at 0xb78d578c>
    Traceback (most recent call last):
      File "/usr/lib/python2.3/site-packages/tvgrab/urlutils.py", line 268, in get_htmlstructure
        parser.close()
      File "/usr/lib/python2.3/HTMLParser.py", line 112, in close
        self.goahead(1)
      File "/usr/lib/python2.3/HTMLParser.py", line 164, in goahead
        self.error("EOF in middle of construct")
      File "/usr/lib/python2.3/HTMLParser.py", line 115, in error
        raise HTMLParseError(message, self.getpos())
    HTMLParseError: EOF in middle of construct, at line 1, column 35722
    ERRO: Recebi exceo: <HTMLParser.HTMLParseError instance at 0xb78e2dcc>
    Traceback (most recent call last):
      File "/usr/lib/python2.3/site-packages/tvgrab/urlutils.py", line 268, in get_htmlstructure
        parser.close()
      File "/usr/lib/python2.3/HTMLParser.py", line 112, in close
        self.goahead(1)
      File "/usr/lib/python2.3/HTMLParser.py", line 164, in goahead
        self.error("EOF in middle of construct")
      File "/usr/lib/python2.3/HTMLParser.py", line 115, in error
        raise HTMLParseError(message, self.getpos())
    HTMLParseError: EOF in middle of construct, at line 1, column 35722
    ERRO: Erro de anlise: http://tudonoar.uol.com.br/tudonoar/gradeProgramacao.aspx no pode ser analisado corretamente.
    Traceback (most recent call last):
      File "/usr/lib/python2.3/site-packages/tvgrab/grab.py", line 344, in get_htmlstruct
        keep_empty_tags = self.keep_empty_tags )
      File "/usr/lib/python2.3/site-packages/tvgrab/urlutils.py", line 301, in get_urlparsed
        raise pe
    ParseError: <unprintable instance object>
    ERRO: Abortando...
    Traceback (most recent call last):
      File "/usr/lib/python2.3/site-packages/tvgrab/grab.py", line 344, in get_htmlstruct
        keep_empty_tags = self.keep_empty_tags )
      File "/usr/lib/python2.3/site-packages/tvgrab/urlutils.py", line 301, in get_urlparsed
        raise pe
    ParseError: <unprintable instance object>
    Traceback (most recent call last):
      File "./tv_grab_br_uol", line 642, in ?
        Grabber( sys.argv )
      File "/usr/lib/python2.3/site-packages/tvgrab/grab.py", line 437, in __init__
        self.read_conf()
      File "/usr/lib/python2.3/site-packages/tvgrab/grab.py", line 472, in read_conf
        self.config()
      File "/usr/lib/python2.3/site-packages/tvgrab/grab.py", line 544, in config
        channels = self.get_conf_channels()
      File "/usr/lib/python2.3/site-packages/tvgrab/grab.py", line 677, in get_conf_channels
        self._reload_channels()
      File "/usr/lib/python2.3/site-packages/tvgrab/grab.py", line 708, in _reload_channels
        map( assemble_channel, self.get_channels() ) )
      File "./tv_grab_br_uol", line 294, in get_channels
        url          = self.get_url_channels( date )
      File "./tv_grab_br_uol", line 250, in get_url_channels
        html_struct  = self.get_htmlstruct( self.url[ "channels" ].url )
      File "/usr/lib/python2.3/site-packages/tvgrab/grab.py", line 349, in get_htmlstruct
        self.dump_file( e.contents )
      File "/usr/lib/python2.3/site-packages/tvgrab/grab.py", line 377, in dump_file
        outfile.write( contents )
    UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3225: ordinal not in range(128)
    ------------------------------------------------------------------------------------------------------------

    Could someone help me?

    best regards
    Marcos Lenharo

     
    • Marcos Lenharo
      Marcos Lenharo
      2005-05-01

      I found it!

      some Program description are under <td title="something"> tag but sometimes the generated html file has  <td title="something "special" "> and this line make HTMLParser crash.
      Solution was to replace <td title="asdasd"> to <td>.

      I also changed the character enconding to " ISO-8859-1" in order to get words like "Ttulo" properly translated.

      Following patch make the trick:

      cat tv_grab_br_uol_charset_tdtitle.patch
      --- /usr/bin/tv_grab_br_uol     2005-04-28 21:27:01.000000000 -0300
      +++ tv_grab_br_uol      2005-05-01 15:27:00.000000000 -0300
      @@ -134,7 +134,7 @@

           __conf_version__     = '1'

      -    page_charset = "utf-8"
      +    page_charset = "ISO-8859-1"

           # URL to provider
           _base_url = "http://tudonoar.uol.com.br/tudonoar/"
      @@ -216,6 +216,7 @@
               ( re.compile( "<table width='620' border='1'></tr>" ), "<table width='620' border='1'>" ),
               ( re.compile( "<table border='1'></tr>" ), "<table border='1'>" ),
               ( re.compile( "</a> </td>" ), "</td>" ),
      +        ( re.compile( "<td title=[^>]*>"),"<td>" ),
               ]

       
    • gsanse
      gsanse
      2005-09-21

      I am having the same problem here, but I don't know how to use the patch you posted. Cauld you please explain it better so a newbie like me or others that might br having the same problem can use it. Thank you very much.

      gsanse

       
      • Marcos Lenharo
        Marcos Lenharo
        2005-09-22

        Hi,

        in order to patch your code just do the following:
        -Enter in the directory where the tv_grab_br_uol program is installed:
            cd /path/to/program
        -Apply the patch using the patch program:
           patch < /path/to/patch/tv_grab_br_uol.regexp.patch

        Save the following code with the name tv_grab_br_uol.regexp.patch

        diff -u orig/tv_grab_br_uol new/tv_grab_br_uol
        --- orig/tv_grab_br_uol 2004-11-22 22:28:07.000000000 -0200
        +++ new/tv_grab_br_uol  2005-09-22 00:18:58.000000000 -0300
        @@ -33,6 +33,9 @@
        #

        import sys
        +reload(sys)
        +sys.setdefaultencoding('iso-8859-1')
        +del sys.setdefaultencoding
        import string
        import re
        import copy
        @@ -49,7 +52,6 @@
        from tvgrab.output import red, green, blue, turquoise, yellow, purple, bold

        -
        class Grabber ( Grab_C ):
             """TV Grab Brazil (source - http://tudonoar.uol.com.br/\)

        @@ -134,7 +136,7 @@

             __conf_version__     = '1'

        -    page_charset = "utf-8"
        +    page_charset = "ISO-8859-1"

             # URL to provider
             _base_url = "http://tudonoar.uol.com.br/tudonoar/"
        @@ -188,6 +190,8 @@
                 re.compile( "<!-- *[^->]* *-->", re.I ),
                 # Start Tags (with args):
                 re.compile( "<(b|br|tbody|font|img) +[^>]*>", re.I ),
        +        # Start Tags (with args):
        +        re.compile( "DEd.write\([^)]*\)" ),
                 # Start Tags (without args):
                 re.compile( "<(b|br|tbody|font|img)>", re.I ),
                 # End Tags:
        @@ -216,6 +220,7 @@
                 ( re.compile( "<table width='620' border='1'></tr>" ), "<table width='620' border='1'>" ),
                 ( re.compile( "<table border='1'></tr>" ), "<table border='1'>" ),
                 ( re.compile( "</a> </td>" ), "</td>" ),
        +        ( re.compile( "<td title=[^>]*>"),"<td>" ),
                 ]

        I think it's all.

        Let me know if you have further questions.

        Cheers,
        Marcos Lenharo

         
    • gsanse
      gsanse
      2005-09-26

      Lenharo,

      Thank you very much for your answer. I followed your instructions, but just could not make the patch work. Not sure what I am doing wrong, but when I try to patch I either get a error message back or the terminal just freezes and I have to stop the patch utility.  At the end I edited the file manually with the changes listed in the patch and it worked perfectly. For the newbies like me out there who are not familiar with patch files, lines that start with + mean lines that needs to be added and lines that begin with - mean lines that must be deleted from the file. This means for example that the line

      +reload(sys)

      means this line must be added to the file after the line

      import sys

      I am trying to use the script with mythtv. I created a symbolic link from one of the standard mythtv grabbers (like tv_grab_au) to the tv_grab_br_uol script and it seems to be working.

      Special thanks to Chris Ottrey and Fabio Yamamoto for the help provided.

      Regards,

      Glaucio