Indexing Mix Collection

  • Magen

    Magen - 2013-07-17


    I want to combine an English collection and an Arabic collection, and index them as one collection using Lemur-4.12. As I have Arabic texts, I set the "docFormat" parameter as "arabic". As the English documents contain only English letters and numbers, converting its encoding from "us-ascii" to "cp1256" does not make any change. But, the "BuildIndex" get the below exception: (I think when it starts parsing English documents).

    Exception [code = 4294967292]
    ../src/BuildIndex.cpp(247): Could not parse file
    ../src/Keyfile.cpp(103): Caught an internal error while getting record for key:
    Program aborted due to exception

    How should I fix it?
    Thank you in advance.

    Last edit: Magen 2013-07-17
  • Magen

    Magen - 2013-07-17

    It does not make this error at the start of English documents. It indexes some English documents and stops at this document:

    BERLIN, Nov 22 (AFP) - Angela Merkel made history Tuesday when she was elected German chancellor, becoming the first woman, the first from the former communist east and the youngest person to lead Europe's biggest economy.She received an overwhelming majority -- 397 of the 611 valid ballots -- in a vote in the Bundestag lower house of parliament.
    It came two months after her Christian Democrats (CDU) narrowly won a general election against outgoing leader Gerhard Schroeder's Social Democrats (SPD).
    "Mr. Speaker, I accept the election," said Merkel, addressing parliamentary speaker Norbert Lammert.
    Schroeder, 61, was the first to congratulate her.
    The 51-year-old Merkel, dressed in an elegant black trouser suit, was to be sworn in later Tuesday. She cracked a small smile as the results were announced and appeared to fight back tears as the deputies applauded her.
    The conservative faces a daunting task, particularly as she has been forced to head up an unwieldy left-right "grand coalition" with her political rivals.
    The election's outcome underlined Germans' rejection of her more radical economic reform plans in favor of one leavened with the SPD's traditional leftist policies on the labor market and the social welfare system.
    Merkel has set a goal of returning Germany to the top three countries in Europe for economic growth within 10 years and slashing the 11-percent unemployment rate during her four-year term.
    "The goal is more jobs," she said Friday at the ceremonial signing of the coalition agreement titled "Together for Germany -- with Courage and Humanity".
    "In four years, people must be able to say they are doing a bit better than they were."
    The statement was typical for the determined but self-effacing pastor's daughter who lacks the charisma and occasional flamboyance of her predecessor Schroeder.
    Merkel, a trained physicist, did not begin her political career until after the Berlin Wall fell in 1989, leading many observers to brand her an "outsider" who may nevertheless be better able to transcend the often clubby world of German politics.
    She has undergone an astounding transformation since serving in the cabinet of her mentor Helmut Kohl, who gave her the affectionate but condescending nickname "the girl".
    Merkel rocketed to the top of the party in 2000 after publicly calling for Kohl's ouster -- a brazen move that made her several powerful enemies.
    Her biographer, Gerd Langguth, told public broadcaster Bayerischer Rundfunk Tuesday that Merkel is frequently underestimated.
    "She always wanted to show the old guys at the CDU that she could do it," he said.
    "She is incredibly hard-working and disciplined. Her parents always told her: 'Angela, you must be better than all the rest' ...and this being-better-than-the-others has marked her whole career."
    Despite frequent comparisons with Britain's Margaret Thatcher, the constraints of the power-sharing government -- Germany's first since the 1960s -- will keep her from forging a radical path akin to that of the "Iron Lady".
    Rather, she will be bound by the compromises hammered out in the coalition agreement, as well as by the demands of her "equal partner", the Social Democrats.
    Her vice-chancellor, Franz Muentefering of the SPD, said the center-left team had been pleasantly surprised by the dealings so far with Merkel, who was described by some participants as consensus-oriented and even charming.
    "We three pledged that we will develop good, constructive policies together," he said Monday, referring to himself, Merkel and new SPD chief Matthias Platzeck.
    Merkel is to be sworn in as chancellor at 1300 GMT, followed by the 16 government ministers at 1500 GMT.
    Schroeder, who finally relented in a bitter power struggle with Merkel after the September election, will formally hand over the chancellery in central Berlin at 1600 GMT, with an inaugural cabinet meeting scheduled an hour later.
    Merkel will embark Wednesday on a trip to Paris and then Brussels followed by London on Thursday, in three brief get-acquainted visits with Germany's closest European allies.
    Schroeder, for his part, plans to relinquish his seat in parliament Wednesday and has said he wants to return to practicing law.

    I put cout in the keyfile.cpp, function "put", and print the "key" and "value".
    The indexer proceeds till
    "in key file 22 value is 0xbff6fc08
    in key file east value is 0xbff6fc08
    in key file 397 value is 0xbff6fc08
    in key file 611 value is 0xbff6fc08
    in key file cdu value is 0xbff6fc08
    in key file parliamentary value is 0xbff6fc08
    in key file norbert value is 0xbff6fc08
    in key file lammert value is 0xbff6fc08
    in key file 61 value is 0xbff6fc08
    in key file congratulate value is 0xbff6fc08
    in key file 51 value is 0xbff6fc08
    in key file trouser value is 0xbff6fc08
    in key file cracked value is 0xbff6fc08
    in key file smile value is 0xbff6fc08
    in key file unwieldy value is 0xbff6fc08
    in key file leavened value is 0xbff6fc08
    in key file value is 0xbff6fc08"

    I do not know what's wrong with the end of the document, after the "leavened"?

    Thank you for your time.

  • David Fisher

    David Fisher - 2013-07-17

    It appears that the term "SPD's" is being mishandled by the ArabicParser.

    Note that the Lemur Toolkit text handler chain has been deprecated as of version 4.12, and will not be receiving updates.

  • Magen

    Magen - 2013-07-17

    Thank you for your reply.

    Yes, It's correct. When I convert "SPD's" to "SPD 's", the indexer will pass this document. But, there are a lot of such cases in the collection.

    Is there any way that I can fix it in the Lemur code?

    Thank you.

  • David Fisher

    David Fisher - 2013-07-19

    In ArabicParser.l change line 216 (inside the case ACRONYM2:) to

     for (c = Arabictext; *c != '\'' && *c != '\0' ; c++, len++); 

    and recompile the application.

    The error is that the match on 's' was matching the first character of the acronym, causing an empty term to be inserted.

  • Magen

    Magen - 2013-07-20

    Thank you so much for your help.


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks