Menu

What are .arpabo files and are they needed?

Help
Mark
2008-11-11
2012-09-22
  • Mark

    Mark - 2008-11-11

    I'm using Pocketsphinx inside FreeSwitch and unintentionally ran into a problem with a "pizza ordering" demo from the FreeSwitch wiki.

    The "pizza" grammar files and folders that the FreeSwitch wiki pointed to for downloading had a problem. Their "pizza ordering" dialogue occurred up to a point then FreeSwitch crashed. This occurred when the dialogue changed to a folder were it had an "empty" .lm file (i.e. no text in the file). The associated .arpabo file had the content that the .lm file should have had but this didn't make a difference. On the other hand, folders that had text in their .lm files had nothing in their .arpabo files but their seemed to be no problems with these parts of the demos dialogue.

    I used LMtool (http://www.speech.cs.cmu.edu/tools/lmtool.html) to generate .lm files from the .corpus files in this demo and it generated the exact same content that was found in either the .lm or .arpabo files.

    When I placed the content of the .arpabo files into the associated .lm files to give me a complete set. The demo seemed to work fine.

    However, I'm not sure that this was because of the way I'm dialing into or using FreeSwitch. So, I'm wondering what these .arpabo files are and are they needed in PocketSphinx?

    Mark.

     
    • Nickolay V. Shmyrev

      ARPA is the format for language model files used in sphinx decoders. ARPA language model can have arpabo extension or lm extension, it doesn't matter actually.

      About freeswitch grammars, it seems they are ok. Each folder have both lm and arpabo files and they are equal. They were just generated with lmtool. So I wonder if empty file is appeared due to unpacking issues.

       
      • Mark

        Mark - 2008-11-11

        I thought it was unpacking as well.

        So I tried a couple unpackers, IZarc and Winzip 12. They gave me slightly different results. In Winzip 12 the lm files were all missing but in IZarc some lm files were empty. Another difference was that IZarc unpacked files to their respective folders but WinZip 12 didn't but this may have been a settings issue. Currently, they unpack with the right content in lm files but they didn't a few days ago so a member of the FreeSwitch team must have fixed the problem after I let them know.

        Anyway, this all got me curious about these file extensions. Why are there two extensions (lm and arpabo) if it doesn't matter? But it seems to matter since only when lm files were corrected that made the "pizza demo" work.

        Thanks

        Mark

         
        • Nickolay V. Shmyrev

          It's just an unimportant issue with online lmtool which sends you both files in archive where arpabo is a link on .lm. Not all windows archivers support that. You can complain to lmtool author to get this issue fixed.

           
          • Mark

            Mark - 2008-11-11

            I have no problem with arpabo files not being needed.

            Another item I would appreciate clarification on is found in FreeSwitch's Mod pocketsphinx wiki.

            http://wiki.freeswitch.org/wiki/Mod_pocketsphinx

            It comes up in the section "Building your own grammar files" and it has to do with a Perl script (quick_lm.pl) used to construct the grammar files.
            As input, .sent files are used that have delimiters <s> and </s> around each sentence like

            <s> THIS IS SENTENCE NUMBER ONE </s>
            <s> THIS IS SENTENCE NUMBER TWO </s>

            and gives a .sent.arpabo file. But .corpus files don't use these delimiters.

            The lmtool asks for .corpus files but will it work for .sent files?

            What's the purpose of .sent files if one can use .corpus files?

            Thanks.

            Mark.

             
            • Nickolay V. Shmyrev

              Lmtool doesn't insert <s> and </s>. You need to manually insert them with awk for example. Lmtool must work with .sent files.

              .corpus is a temporary file that is not used at all, there is not sense to include it into the package or generate with lmtool. Just ignore it.

               
              • Mark

                Mark - 2008-11-12

                Got it, sentence corpus files are .sent files.
                That helps clear things up.

                Thanks.

                 

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.