Iban HATCHONDO - 2010-12-28

Hi all,

First, let me thanks everyone for the effort and especially to David
for his amazing work.

I know David released a new Java version using hadoop, but I think
I'll stick for a while with the old DB format. So, I have few
questions regarding the PERL extract :) : (I'll be referring to the
PERL multilingual files)

- during extractCoreSummariesFromDump article text is traversed line
by line (line 499) using a regex when disambiguation page, and is then
re-traversed link by link (line 579). But during the line by line
traversal, two clauses may shorten the process: See Also or
translations have been reached. It this case, the next step of link by
link traversal continues from where the prior one has been stopped. 

Question: should it be like that ? or should it be systematically
re-traversed for the links?

- anchor extraction: Anchor are cleaned up when saved in the hashtable
  (line 640 extractWikipediaData.pl) and then line 758 they are
  cleaned again when output in the anchor file. Shouldn't it done once

- anchor saving:
  + should multiple quote from anchors be removed?
  + should html elements stripped? (like <small>)

Cheers and happy holidays to all of you,