Menu

[code] Support for maff format (Mozilla archive format) 1/3

andrew goh
2020-02-27
2020-02-28
  • andrew goh

    andrew goh - 2020-02-27

    this is a code contribution to support the maff (Mozilla archive format)

    https://en.wikipedia.org/wiki/Mozilla_Archive_Format

    maff format is a zip file in which the web page contents is saved within the zip file.

    http://maf.mozdev.org/maff-specification.html

    There are a couple of files edited:

    this is a the codes for the parser itself.
    src/net/sourceforge/docfetcher/model/parse/MaffParser.java

    there are some edits in these files
    src/net/sourceforge/docfetcher/enums/Msg.java

            filetype_html ("HTML (html, htm, ..)", Comments.filetype),
    +       filetype_maf ("Mozilla Archive Format (maf, maff)", Comments.filetype),
            filetype_odt ("OpenOffice.org Writer (odt, ott)", Comments.filetype),
    

    src/net/sourceforge/docfetcher/model/parse/ParseService.java

                    new EpubParser(),
                    new ChmParser(),
    +               new MaffParser(),
    
                    new OpenOfficeWriterParser(),
                    new OpenOfficeCalcParser(),
    

    src/net/sourceforge/docfetcher/model/parse/ParseServiceTest.java

    public final class ParseServiceTest {
                    .add(TestFiles.lorem_ipsum_abw_gz.get(), new AbiWordParser())
                    .add(TestFiles.lorem_ipsum_docx.get(), new MSWord2007Parser())
                    .add(TestFiles.lorem_ipsum_html.get(), new HtmlParser())
    +               .add(TestFiles.lorem_ipsum_maf.get(), new MaffParser())
                    .add(TestFiles.lorem_ipsum_odt.get(), new OpenOfficeWriterParser())
                    .add(TestFiles.lorem_ipsum_pdf.get(), new PdfParser())
                    .add(TestFiles.lorem_ipsum_rtf.get(), new RtfParser())
    
    
                           MSOffice2007Parser.MSExcel2007Parser.class,
                            MSOffice2007Parser.MSPowerPoint2007Parser.class,
                            MSOffice2007Parser.MSWord2007Parser.class,
    +                       MaffParser.class,
                            OpenOfficeParser.OpenOfficeCalcParser.class,
                            OpenOfficeParser.OpenOfficeDrawParser.class,
                           OpenOfficeParser.OpenOfficeImpressParser.class
    
                            MSOffice2007Parser.MSExcel2007Parser.class,
                            MSOffice2007Parser.MSPowerPoint2007Parser.class,
    +                       MaffParser.class,
                            OpenOfficeParser.OpenOfficeCalcParser.class,
                            OpenOfficeParser.OpenOfficeDrawParser.class,                       
    

    ^ i'm having to add MaffParser as a test case for mime type overlap as maff is a zip application type and has a mime type of html. it seem doc fetcher is able to still detect them correctly based on the file name extension ".maf" or ".maff"

    an example test file and updating the reference
    dev/test-files/lorem-ipsum/lorem-ipsum.maff
    src/net/sourceforge/docfetcher/TestFiles.java

    the source codes is in the zip archive attached. this is the only post as pior i'm not aware of the attachment feature. so there isn't any follow-up 2/3 and 3/3 separate threads of this thread.

     

    Last edit: andrew goh 2020-02-27
  • andrew goh

    andrew goh - 2020-02-27

    i run it to index a mix of a significant volume of files, so i'd guess it should work rather well.
    maff is still supported with this browser extension, there are other extensions supporting it as well
    https://github.com/danny0838/webscrapbook

     

    Last edit: andrew goh 2020-02-27
  • Nam-Quang Tran

    Nam-Quang Tran - 2020-02-28

    Hi,

    really sorry, but at present I simply don't have any time to review and integrate code contributions.

    Best regards
    q:-) <= Quang

     
  • andrew goh

    andrew goh - 2020-02-28

    no worries, no hurry. but those who needs it can use the codes themselves
    this solved a difficult problem for me, each maf zip archive is treated as a file rather than a folder instead of indexing the internal contents which will always have a file name of index.html.

     

Log in to post a comment.