Question regarding old perl process

2010-12-28
2013-05-30
  • Iban HATCHONDO
    Iban HATCHONDO
    2010-12-28

    Hi all,

    First, let me thanks everyone for the effort and especially to David
    for his amazing work.

    I know David released a new Java version using hadoop, but I think
    I'll stick for a while with the old DB format. So, I have few
    questions regarding the PERL extract :) : (I'll be referring to the
    PERL multilingual files)

    - during extractCoreSummariesFromDump article text is traversed line
    by line (line 499) using a regex when disambiguation page, and is then
    re-traversed link by link (line 579). But during the line by line
    traversal, two clauses may shorten the process: See Also or
    translations have been reached. It this case, the next step of link by
    link traversal continues from where the prior one has been stopped. 

    Question: should it be like that ? or should it be systematically
    re-traversed for the links?

    - anchor extraction: Anchor are cleaned up when saved in the hashtable
      (line 640 extractWikipediaData.pl) and then line 758 they are
      cleaned again when output in the anchor file. Shouldn't it done once
      only?

    - anchor saving:
      + should multiple quote from anchors be removed?
      + should html elements stripped? (like <small>)

    Cheers and happy holidays to all of you,
    Iban.