First, let me thanks everyone for the effort and especially to David
for his amazing work.
I know David released a new Java version using hadoop, but I think
I'll stick for a while with the old DB format. So, I have few
questions regarding the PERL extract :) : (I'll be referring to the
PERL multilingual files)
- during extractCoreSummariesFromDump article text is traversed line
by line (line 499) using a regex when disambiguation page, and is then
re-traversed link by link (line 579). But during the line by line
traversal, two clauses may shorten the process: See Also or
translations have been reached. It this case, the next step of link by
link traversal continues from where the prior one has been stopped.
Question: should it be like that ? or should it be systematically
re-traversed for the links?
- anchor extraction: Anchor are cleaned up when saved in the hashtable
(line 640 extractWikipediaData.pl) and then line 758 they are
cleaned again when output in the anchor file. Shouldn't it done once
- anchor saving:
+ should multiple quote from anchors be removed?
+ should html elements stripped? (like <small>)
Cheers and happy holidays to all of you,