From: Ed A. <ep...@us...> - 2002-01-31 16:08:04
|
Update of /cvsroot/xmltv/xmltv In directory usw-pr-cvs1:/tmp/cvs-serv23236 Added Files: tv_extractinfo_en Log Message: Added tv_extractinfo_en, which reads English-language programme descriptions and attempts to sniff out information which could better be stored in machine-readable form. This is mostly code which used to live in the old scrapped_getlistings_uk_ananova in the attic/ directory, I've just ported it to the new data structures and tidied it up. This sort of regular expression matching works well on the long detailed descriptions Ananova provides. It's not so good on the North American listings because they have shorter descriptions. But it did manage to extract the names of quiz show hosts. --- NEW FILE: tv_extractinfo_en --- #!/usr/bin/perl -w # # tv_extractinfo_en # # Look at programme descriptions and other text, and extract # information from the textual descriptions into subelements of # <programme>. This tv_extractinfo handles English-language # descriptions. # # It also attempts to split multipart programmes into their # constituents, by looking for a description that seems to contain # lots of times and titles. But this depends on the description # following the particular style used by Ananova. If I find more # examples of listings with multipart programmes it can be extended. # # -- Ed Avis, ep...@do..., 2002-01-31 # $Id: tv_extractinfo_en,v 1.1 2002/01/31 15:39:31 epaepa Exp $ # [...1409 lines suppressed...] } } # More debugging aids. sub cst( $ ) { my $p = shift; croak "prog $p->{title}->[0]->[0] has bogus stop time" if exists $p->{stop} and $p->{stop} eq 'boogus FIXME XXX'; } sub no_shared_scalars( $ ) { my %seen; foreach my $h (@{$_[0]}) { foreach my $k (keys %$h) { my $ref = \ ($h->{$k}); my $addr = "$ref"; $seen{$addr}++ && die "scalar $addr seen twice"; } } } |