python xmltv tv_grab / Bugs / #36 tv_grab_br

#36 tv_grab_br_uol return with errors

Milestone: operation

Status: open

Owner: Gustavo Sverzut Barbieri

Labels: br_uol (3)

Priority: 5

Updated: 2006-12-12

Created: 2006-12-12

Creator: Anonymous

Private: No

Always happen when I try to run this grab:

.......................................................
Traceback (most recent call last):
File "/usr/lib/python2.4/site-packages/tvgrab/urlutils.py", line 266, in get_htmlstructure
parser.feed( contents )
File "/usr/lib/python2.4/site-packages/tvgrab/customizedparser.py", line 243, in feed
raise pe
ParseError: <unprintable instance object>
ERRO: Recebi exceção: <HTMLParser.HTMLParseError instance at 0xb794488c>
Traceback (most recent call last):
File "/usr/lib/python2.4/site-packages/tvgrab/urlutils.py", line 266, in get_htmlstructure
parser.feed( contents )
File "/usr/lib/python2.4/site-packages/tvgrab/customizedparser.py", line 243, in feed
raise pe
ParseError: <unprintable instance object>
ERRO: Erro de análise: http://tudonoar.uol.com.br/tudonoar/gradeProgramacao.aspx não pode ser analisado corretamente.
Traceback (most recent call last):
File "/usr/lib/python2.4/site-packages/tvgrab/grab.py", line 344, in get_htmlstruct
keep_empty_tags = self.keep_empty_tags )
File "/usr/lib/python2.4/site-packages/tvgrab/urlutils.py", line 301, in get_urlparsed
raise pe
ParseError: <unprintable instance object>
ERRO: Abortando...
Traceback (most recent call last):
File "/usr/lib/python2.4/site-packages/tvgrab/grab.py", line 344, in get_htmlstruct
keep_empty_tags = self.keep_empty_tags )
File "/usr/lib/python2.4/site-packages/tvgrab/urlutils.py", line 301, in get_urlparsed
raise pe
ParseError: <unprintable instance object>
Traceback (most recent call last):
File "/usr/bin/tv_grab_br_uol", line 642, in ?
Grabber( sys.argv )
File "/usr/lib/python2.4/site-packages/tvgrab/grab.py", line 437, in __init__
self.read_conf()
File "/usr/lib/python2.4/site-packages/tvgrab/grab.py", line 472, in read_conf
self.config()
File "/usr/lib/python2.4/site-packages/tvgrab/grab.py", line 544, in config
channels = self.get_conf_channels()
File "/usr/lib/python2.4/site-packages/tvgrab/grab.py", line 677, in get_conf_channels
self._reload_channels()
File "/usr/lib/python2.4/site-packages/tvgrab/grab.py", line 708, in _reload_channels
map( assemble_channel, self.get_channels() ) )
File "/usr/bin/tv_grab_br_uol", line 294, in get_channels
url = self.get_url_channels( date )
File "/usr/bin/tv_grab_br_uol", line 250, in get_url_channels
html_struct = self.get_htmlstruct( self.url[ "channels" ].url )
File "/usr/lib/python2.4/site-packages/tvgrab/grab.py", line 349, in get_htmlstruct
self.dump_file( e.contents )
File "/usr/lib/python2.4/site-packages/tvgrab/grab.py", line 375, in dump_file
outfile = open( outfile, "w" )
IOError: [Errno 13] Permissão negada: '/usr/bin/tv_grab_br_uol-FJKzW2.html'
-------------------------------------------------------
pytvgrab-lib version: 0.5.1
tv_grab_br_uol version: 0.6.0
Python : 2.4.4
Linux : Gentoo 2006.1

Discussion

Gustavo Sverzut Barbieri - 2006-12-12

Logged In: YES
user_id=511989
Originator: NO

This grabber is not being updated anymore.

I have no plans to support it atm, but if you really are interested I can help you with it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2006-12-12

Logged In: NO

I'm a programmer, but, I dont' know nothing about python and the functions in the graber.
Has any doc about this? I realy want to make progress in the graber, cos, the brazilians
have no support for this. The only I found is for windows :( You recommend any other graber
that's work?

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gustavo Sverzut Barbieri - 2006-12-12

Logged In: YES
user_id=511989
Originator: NO

I know we brazilians have no great deal of grabbers... That's why I started this project together with Chris. I just stopped maintaining it because UOL's HTML is horrible and I don't use xmltv anymore.

I can introduce you to python and pytvgrab basics if you want. You don't need to change much, just update cleanup regexps and undestand the basic HTML structure.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2006-12-13

Logged In: NO

HTML I know the basics. My idea is know more about the functions of graber, and then, change the site from UOL to www.tvmagazine.com.br.
I can found some info in the pytvgrab-lib?

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gustavo Sverzut Barbieri - 2006-12-13

Logged In: YES
user_id=511989
Originator: NO

Ok, all you _MUST_ know is HTML basics and regular expressions.

pytvgrab-lib basically provides 2 kind of basic grabbers you may use:
- regular expression based (re2)
- xml tree based (customized parser)

The first approach (re2) is used by Chris and his grabbers (australian and some others), while the second is used by me and my brazilian grabber.

Chris have developed re2, a layer atop python's "re" (regular expression) that supports multiple and hierarchical matching. Basically you write one big regular expression that have sub expressions with names. Then these names will go to a hierarchy of hash tables (dict or associative arrays) with list of results that match.

Mine approach is a bit more conventional, but requires a bit more code. You clean your HTML to make it less cumbersome using plain regular expressions (I call them re_clean), things like removing CSS and JavaScript, fixing known to be broken tags, ... Then you use this buffer with CustomizedParser, that will know which tags you're interested, removing useless tags. This way you will end with a simple tree, that you can access without trouble.

With my approach, if you have something like <a><c>some</c> thing</a> and are interested only in tag, then you will get this tree as result: some thing. Usually you will just keep <a>, <tr>, <td> and <table>.

Also, CustomizedParser has some debug utilities, you can dump resulting HTML and open in browser to understand the structure that builds the page.

I don't know tvmagazine, I need to find out some spare time to dig into its HTML.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

tv_grab_br_uol return with errors

Group

Searches

Help

#36 tv_grab_br_uol return with errors

Discussion