Question regarding old perl process


    Hi all,

    First, let me thanks everyone for the effort and especially to David
    for his amazing work.

    I know David released a new Java version using hadoop, but I think
    I'll stick for a while with the old DB format. So, I have few
    questions regarding the PERL extract :) : (I'll be referring to the
    PERL multilingual files)

    - during extractCoreSummariesFromDump article text is traversed line
    by line (line 499) using a regex when disambiguation page, and is then
    re-traversed link by link (line 579). But during the line by line
    traversal, two clauses may shorten the process: See Also or
    translations have been reached. It this case, the next step of link by
    link traversal continues from where the prior one has been stopped. 

    Question: should it be like that ? or should it be systematically
    re-traversed for the links?

    - anchor extraction: Anchor are cleaned up when saved in the hashtable
      (line 640 and then line 758 they are
      cleaned again when output in the anchor file. Shouldn't it done once

    - anchor saving:
      + should multiple quote from anchors be removed?
      + should html elements stripped? (like <small>)

    Cheers and happy holidays to all of you,