Re: [phpodpworld-users] Suggestion on tools/extract.pl
Status: Beta
                
                Brought to you by:
                
                    hansfn
                    
                
            | 
      
      
      From: Hans F. N. <Han...@hi...> - 2009-04-03 23:17:10
      
     | 
| * Howard Lee <hl...@gm...> [2009-01-18]:
> Dear all,
> 
> I find it quite time consuming when extracting multiple categories from
> tools/extract.pl, because the full RDF files will need to be parsed from the
> beginning.
> 
> I have modified the extract.pl so that it can handle multiple categories
> from the same command line. It also does not depend on DMOZ-ParseRDF-0.14
> now. The script has been attached, and hope somebody may find it useful.
I did reply immediately to Howard that I found this script very
interesting. However, I didn't find time to test it before now ...
Unfortunately, it doesn't work as intended:
    ./extract.pl structure.rdf.u8 World
produces a file World-structure.rdf.u8 that contains more 
categories outside World than inside:
# grep 'Topic r:id' World-structure.rdf.u8 | grep -v 'r:id="Top/World' | wc -l
  405753
# grep 'Topic r:id' World-structure.rdf.u8 | grep  'r:id="Top/World' | wc -l
  229470
This also causes it to run slower than the current solution. I don't
have time to debug the script so unless Howard produces a bug fixed
version nothing will change. (On my old computer extracting World take
one minute and 15 seconds - more than quick enough for me.)
Regards,
Hans - who is working on a new release.
PS! Please add "use warnings;" to the script ;-)
 |