Re: [phpodpworld-users] Suggestion on tools/extract.pl
Status: Beta
Brought to you by:
hansfn
From: Hans F. N. <Han...@hi...> - 2010-06-14 11:33:02
|
Thx, Howard for contributing again. Some quick comments: 0) phpODPWorld is not dead ;-) I plan a new release this summer. 1) Your script is slower than using DMOZ-ParseRDF (which now is an integral part of phpODPWorld) so I'll probably not use it. 2) Extracting multiple categories in one RDF file could be useful, but not very. If many users request it, I'll add it. Personally I need (and prefer) seperate RDF files. 3) If you need to re-run the extraction, it's better to unzip the file once in stead of having the script doing the unzipping every single time it runs. This came out very negative, I guess. Sorry about, but I hope you don't mind that much (as long as phpODPWorld still serves your needs). Next time you want to contribute, please base it on the currect code in the SVN repository - see http://sourceforge.net/projects/phpodpworld/develop or directly at http://phpodpworld.svn.sourceforge.net/viewvc/phpodpworld/trunk/phpodpworld/tools/ Regards, Hans PS! phpODPWorld 3.0 is still not released ;-) * Howard Lee <hl...@gm...> [2010-06-13]: > It's been a long while since I wrote to the mailing list. > I have rewritten portion of tools/extract.pl, which has been attached. It is > based on the version 3.0 of phpODPWorld. The following features have been > added, hope someone may find it useful. > 1. The source RDF file can be in text or gzipped format > 2. Multiple categories can be entered for extraction in a single command > line > 3. The script does not require DMOZ-ParseRDF-0.14 to be installed > Regards, > Howard > > On Sat, Apr 4, 2009 at 6:46 AM, Hans F. Nordhaug <Han...@hi...> wrote: > > > I did reply immediately to Howard that I found this script very > > interesting. However, I didn't find time to test it before now ... > > Unfortunately, it doesn't work as intended: > > > > ./extract.pl structure.rdf.u8 World > > > > produces a file World-structure.rdf.u8 that contains more > > categories outside World than inside: > > > > # grep 'Topic r:id' World-structure.rdf.u8 | grep -v 'r:id="Top/World' | wc -l > > 405753 > > # grep 'Topic r:id' World-structure.rdf.u8 | grep 'r:id="Top/World' | wc -l > > 229470 > > > > This also causes it to run slower than the current solution. I don't > > have time to debug the script so unless Howard produces a bug fixed > > version nothing will change. (On my old computer extracting World take > > one minute and 15 seconds - more than quick enough for me.) > > > > Regards, > > Hans - who is working on a new release. > > > > PS! Please add "use warnings;" to the script ;-) |