Iban HATCHONDO - 2010-12-28

Hi all,

First, let me thanks everyone for the effort and especially to David
for his amazing work.

I know David released a new Java version using hadoop, but I think
I'll stick for a while with the old DB format. So, I have few
questions regarding the PERL extract :) : (I'll be referring to the
PERL multilingual files)

- during extractCoreSummariesFromDump article text is traversed line
by line (line 499) using a regex when disambiguation page, and is then
re-traversed link by link (line 579). But during the line by line
traversal, two clauses may shorten the process: See Also or
translations have been reached. It this case, the next step of link by
link traversal continues from where the prior one has been stopped. 

Question: should it be like that ? or should it be systematically
re-traversed for the links?

- anchor extraction: Anchor are cleaned up when saved in the hashtable
  (line 640 extractWikipediaData.pl) and then line 758 they are
  cleaned again when output in the anchor file. Shouldn't it done once
  only?

- anchor saving:
  + should multiple quote from anchors be removed?
  + should html elements stripped? (like <small>)

Cheers and happy holidays to all of you,
Iban.