Re: [phpodpworld-users] Suggestion on tools/extract.pl
Status: Beta
Brought to you by:
hansfn
From: Hans F. N. <Han...@hi...> - 2009-04-03 23:17:10
|
* Howard Lee <hl...@gm...> [2009-01-18]: > Dear all, > > I find it quite time consuming when extracting multiple categories from > tools/extract.pl, because the full RDF files will need to be parsed from the > beginning. > > I have modified the extract.pl so that it can handle multiple categories > from the same command line. It also does not depend on DMOZ-ParseRDF-0.14 > now. The script has been attached, and hope somebody may find it useful. I did reply immediately to Howard that I found this script very interesting. However, I didn't find time to test it before now ... Unfortunately, it doesn't work as intended: ./extract.pl structure.rdf.u8 World produces a file World-structure.rdf.u8 that contains more categories outside World than inside: # grep 'Topic r:id' World-structure.rdf.u8 | grep -v 'r:id="Top/World' | wc -l 405753 # grep 'Topic r:id' World-structure.rdf.u8 | grep 'r:id="Top/World' | wc -l 229470 This also causes it to run slower than the current solution. I don't have time to debug the script so unless Howard produces a bug fixed version nothing will change. (On my old computer extracting World take one minute and 15 seconds - more than quick enough for me.) Regards, Hans - who is working on a new release. PS! Please add "use warnings;" to the script ;-) |