Thread: [phpodpworld-users] Suggestion on tools/extract.pl
Status: Beta
Brought to you by:
hansfn
From: Howard L. <hl...@gm...> - 2009-01-18 13:16:25
|
#!/usr/bin/perl # # This file is part of phpODPWorld and released under GNU GPL. # # $Id: extract.pl,v 1.3 2006/03/24 21:49:59 hansfn Exp $ # use strict; if ($#ARGV < 1){ print "Usage: extract.pl rdffile category...\n"; print "(The RDF file should already be uncompressed\n"; print "and the category name must NOT end with a slash.)\n"; exit; } my $rdffile = $ARGV[0]; if (! -e $rdffile ) { die("RDF file ($rdffile) doesn't exist\n"); } # Determine type of RDF file (based on filename) - "structure" or "content" my $type; if ($rdffile =~ /structure/i) { $type = "structure"; } elsif ($rdffile =~ /content/i) { $type = "content"; } else { $type = "unknown"; } # Creating an array for the categories and sort it my @array = (); my $category; my $ptr = 1; while ( $ARGV[$ptr] ) { $category = "Top/$ARGV[$ptr]"; push (@array, $category); $ptr++; } @array = sort { $a cmp $b }(@array); # The main parsing starts here my $line; my $tmpcat; my $cat = shift(@array); my $match = 0; my $key_cat; open(RDFFILE, $rdffile) or die "Can't open RDF file ($rdffile) for reading: $!"; while(<RDFFILE>) { $line = $_; if ($line =~ m/<Topic r:id="/) { $tmpcat = $line; chomp $tmpcat; $tmpcat =~ s/<Topic r:id="(.*)">/$1/; #print "$tmpcat, $cat\n"; while (substr($tmpcat, 0, (length($cat)-1)) gt $cat) { #print "Switching ($tmpcat) ($cat)\n"; $match = 0; $cat = shift(@array); } if ($tmpcat =~ m/$cat/) { if ($match == 0) { print "Parsing category: $cat\n"; $match = 1; print OUTFILE '</RDF>'; close(OUTFILE); # Removing slashes from category since it will be used in the filename $key_cat = $cat; $key_cat =~ s#Top/##; $key_cat =~ s#/#_#g; my $outfile = "$key_cat-$type.rdf.u8"; open(OUTFILE, ">$outfile") or die "Can't open extracted RDF file ($outfile) for writing: $!"; print OUTFILE '<?xml version="1.0" encoding="UTF-8" ?> <RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://dmoz.org/rdf"> '; } $match = 1; } } if ($match == 1) { # Write line to file print OUTFILE $line; } else { # Skip this line } } close(RDFFILE); print OUTFILE '</RDF>'; close(OUTFILE); exit; |
From: Hans F. N. <Han...@hi...> - 2009-04-03 23:17:10
|
* Howard Lee <hl...@gm...> [2009-01-18]: > Dear all, > > I find it quite time consuming when extracting multiple categories from > tools/extract.pl, because the full RDF files will need to be parsed from the > beginning. > > I have modified the extract.pl so that it can handle multiple categories > from the same command line. It also does not depend on DMOZ-ParseRDF-0.14 > now. The script has been attached, and hope somebody may find it useful. I did reply immediately to Howard that I found this script very interesting. However, I didn't find time to test it before now ... Unfortunately, it doesn't work as intended: ./extract.pl structure.rdf.u8 World produces a file World-structure.rdf.u8 that contains more categories outside World than inside: # grep 'Topic r:id' World-structure.rdf.u8 | grep -v 'r:id="Top/World' | wc -l 405753 # grep 'Topic r:id' World-structure.rdf.u8 | grep 'r:id="Top/World' | wc -l 229470 This also causes it to run slower than the current solution. I don't have time to debug the script so unless Howard produces a bug fixed version nothing will change. (On my old computer extracting World take one minute and 15 seconds - more than quick enough for me.) Regards, Hans - who is working on a new release. PS! Please add "use warnings;" to the script ;-) |
From: Howard L. <hl...@gm...> - 2010-06-13 11:08:26
|
#!/usr/bin/perl # # This file is part of phpODPWorld and released under GNU GPL. # # $Id: extract.pl,v 1.3 2006/03/24 21:49:59 hansfn Exp $ # use strict; use warnings; if ($#ARGV < 1) { print "Usage: extract.pl rdffile category [...]\n\n"; print "rdffile Specifies the RDF file for parsing\n"; print " (can be text format or compressed by gzip)\n"; print "category Specifies the category to extract\n"; print " (separate multiple categories with space)\n\n"; print "e.g. extract.pl structure.rdf.u8.gz World/Norsk Regional/Europe/Norway\n"; exit; } my $rdffile = $ARGV[0]; if (! -e $rdffile ) { die("RDF file ($rdffile) doesn't exist\n"); } # Determine type of RDF file (based on filename) - "structure" or "content" my $type; if ($rdffile =~ /structure/i) { $type = "structure"; } elsif ($rdffile =~ /content/i) { $type = "content"; } else { $type = "unknown"; } # Creating an array for the categories my $i = 0; my @categories = (); while ( $ARGV[$i+1] ) { push (@categories, "$ARGV[$i+1]"); $i++; } my $catsize = $i; my @catfh = (); my @catmatch = (); # Open files for each category for ($i = 0; $i < $catsize; $i++) { $catmatch[$i] = 0; my $safecategory = $categories[$i]; $safecategory =~ s#/#_#g; my $outfile = "$safecategory-$type.rdf.u8"; open ($catfh[$i], ">$outfile") or die "Can't open RDF file ($outfile) for writing: $!"; print {$catfh[$i]} '<?xml version="1.0" encoding="UTF-8" ?> <RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://dmoz.org/rdf"> '; } # The main parsing starts here my $line; my $catread; $rdffile =~ s/(.*\.gz)$/gzip -dc $1|/; open(RDFFILE, $rdffile) or die "Can't open RDF file ($rdffile) for reading: $!"; while(<RDFFILE>) { $line = $_; # Check for Topic lines and compare if ($line =~ m/<Topic r:id="/) { $catread = $line; $catread =~ s/^\s+<Topic r:id="(.*)">/$1/; # print "Current category: $catread\n"; for ($i = 0; $i < $catsize; $i++) { $catmatch[$i] = ($catread =~ m#^Top/$categories[$i]#) ? 1 : 0; } } # Write line to file if matched for ($i = 0; $i < $catsize; $i++) { print {$catfh[$i]} $line if ($catmatch[$i] == 1); } } close(RDFFILE); # Close files for each category for ($i = 0; $i < $catsize; $i++) { print {$catfh[$i]} '</RDF>'; close ($catfh[$i]); } exit; |
From: Howard L. <hl...@gm...> - 2010-06-13 11:16:54
|
I'm not sure why the script was posted without the message. Anyway, here it is again. It's been a long while since I wrote to the mailing list. I have rewritten portion of tools/extract.pl, which has been attached. It is based on the version 3.0 of phpODPWorld. The following features have been added, hope someone may find it useful. 1. The source RDF file can be in text or gzipped format 2. Multiple categories can be entered for extraction in a single command line 3. The script does not require DMOZ-ParseRDF-0.14 to be installed Regards, Howard |
From: Hans F. N. <Han...@hi...> - 2010-06-14 11:33:02
|
Thx, Howard for contributing again. Some quick comments: 0) phpODPWorld is not dead ;-) I plan a new release this summer. 1) Your script is slower than using DMOZ-ParseRDF (which now is an integral part of phpODPWorld) so I'll probably not use it. 2) Extracting multiple categories in one RDF file could be useful, but not very. If many users request it, I'll add it. Personally I need (and prefer) seperate RDF files. 3) If you need to re-run the extraction, it's better to unzip the file once in stead of having the script doing the unzipping every single time it runs. This came out very negative, I guess. Sorry about, but I hope you don't mind that much (as long as phpODPWorld still serves your needs). Next time you want to contribute, please base it on the currect code in the SVN repository - see http://sourceforge.net/projects/phpodpworld/develop or directly at http://phpodpworld.svn.sourceforge.net/viewvc/phpodpworld/trunk/phpodpworld/tools/ Regards, Hans PS! phpODPWorld 3.0 is still not released ;-) * Howard Lee <hl...@gm...> [2010-06-13]: > It's been a long while since I wrote to the mailing list. > I have rewritten portion of tools/extract.pl, which has been attached. It is > based on the version 3.0 of phpODPWorld. The following features have been > added, hope someone may find it useful. > 1. The source RDF file can be in text or gzipped format > 2. Multiple categories can be entered for extraction in a single command > line > 3. The script does not require DMOZ-ParseRDF-0.14 to be installed > Regards, > Howard > > On Sat, Apr 4, 2009 at 6:46 AM, Hans F. Nordhaug <Han...@hi...> wrote: > > > I did reply immediately to Howard that I found this script very > > interesting. However, I didn't find time to test it before now ... > > Unfortunately, it doesn't work as intended: > > > > ./extract.pl structure.rdf.u8 World > > > > produces a file World-structure.rdf.u8 that contains more > > categories outside World than inside: > > > > # grep 'Topic r:id' World-structure.rdf.u8 | grep -v 'r:id="Top/World' | wc -l > > 405753 > > # grep 'Topic r:id' World-structure.rdf.u8 | grep 'r:id="Top/World' | wc -l > > 229470 > > > > This also causes it to run slower than the current solution. I don't > > have time to debug the script so unless Howard produces a bug fixed > > version nothing will change. (On my old computer extracting World take > > one minute and 15 seconds - more than quick enough for me.) > > > > Regards, > > Hans - who is working on a new release. > > > > PS! Please add "use warnings;" to the script ;-) |
From: Howard L. <hl...@gm...> - 2010-06-20 15:34:44
|
Hi Hans, I appreciate the response. The RDF source files are quite large, so it should save time by reducing the need to re-read them. I have a need to extract multiple DMOZ categories, the script I posted was faster for me. I'm sorry that it is slower in your case. After all, I'm not a full time programmer, so it may not be up to standard. I'll take a look at the SVN repository, and see if I can contribute to it. Regards, Howard |
From: Hans F. N. <Han...@hi...> - 2010-06-23 07:57:50
|
* Howard Lee <hl...@gm...> [2010-06-20]: > Hi Hans, > > I appreciate the response. The RDF source files are quite large, so it > should save time by reducing the need to re-read them. I have a need to > extract multiple DMOZ categories, the script I posted was faster for me. I'm > sorry that it is slower in your case. After all, I'm not a full time > programmer, so it may not be up to standard. Oh, your Perl programming skills are probably as good (or better) than mine. Anyway, what made your code slower was that you dropped the usage of the DMOZ parser. (OK, I didn't test a lot but it seemed so.) The addition of support for extraction of multiple categories in one run is a nice feature that I'll probably add. Hans |