public final class ParseServiceTest {
.add(TestFiles.lorem_ipsum_abw_gz.get(), new AbiWordParser())
.add(TestFiles.lorem_ipsum_docx.get(), new MSWord2007Parser())
.add(TestFiles.lorem_ipsum_html.get(), new HtmlParser())
+ .add(TestFiles.lorem_ipsum_maf.get(), new MaffParser())
.add(TestFiles.lorem_ipsum_odt.get(), new OpenOfficeWriterParser())
.add(TestFiles.lorem_ipsum_pdf.get(), new PdfParser())
.add(TestFiles.lorem_ipsum_rtf.get(), new RtfParser())
MSOffice2007Parser.MSExcel2007Parser.class,
MSOffice2007Parser.MSPowerPoint2007Parser.class,
MSOffice2007Parser.MSWord2007Parser.class,
+ MaffParser.class,
OpenOfficeParser.OpenOfficeCalcParser.class,
OpenOfficeParser.OpenOfficeDrawParser.class,
OpenOfficeParser.OpenOfficeImpressParser.class
MSOffice2007Parser.MSExcel2007Parser.class,
MSOffice2007Parser.MSPowerPoint2007Parser.class,
+ MaffParser.class,
OpenOfficeParser.OpenOfficeCalcParser.class,
OpenOfficeParser.OpenOfficeDrawParser.class,
^ i'm having to add MaffParser as a test case for mime type overlap as maff is a zip application type and has a mime type of html. it seem doc fetcher is able to still detect them correctly based on the file name extension ".maf" or ".maff"
an example test file and updating the reference
dev/test-files/lorem-ipsum/lorem-ipsum.maff
src/net/sourceforge/docfetcher/TestFiles.java
the source codes is in the zip archive attached. this is the only post as pior i'm not aware of the attachment feature. so there isn't any follow-up 2/3 and 3/3 separate threads of this thread.
i run it to index a mix of a significant volume of files, so i'd guess it should work rather well.
maff is still supported with this browser extension, there are other extensions supporting it as well https://github.com/danny0838/webscrapbook
Last edit: andrew goh 2020-02-27
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
no worries, no hurry. but those who needs it can use the codes themselves
this solved a difficult problem for me, each maf zip archive is treated as a file rather than a folder instead of indexing the internal contents which will always have a file name of index.html.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
this is a code contribution to support the maff (Mozilla archive format)
https://en.wikipedia.org/wiki/Mozilla_Archive_Format
maff format is a zip file in which the web page contents is saved within the zip file.
http://maf.mozdev.org/maff-specification.html
There are a couple of files edited:
this is a the codes for the parser itself.
src/net/sourceforge/docfetcher/model/parse/MaffParser.java
there are some edits in these files
src/net/sourceforge/docfetcher/enums/Msg.java
src/net/sourceforge/docfetcher/model/parse/ParseService.java
src/net/sourceforge/docfetcher/model/parse/ParseServiceTest.java
^ i'm having to add MaffParser as a test case for mime type overlap as maff is a zip application type and has a mime type of html. it seem doc fetcher is able to still detect them correctly based on the file name extension ".maf" or ".maff"
an example test file and updating the reference
dev/test-files/lorem-ipsum/lorem-ipsum.maff
src/net/sourceforge/docfetcher/TestFiles.java
the source codes is in the zip archive attached. this is the only post as pior i'm not aware of the attachment feature. so there isn't any follow-up 2/3 and 3/3 separate threads of this thread.
Last edit: andrew goh 2020-02-27
i run it to index a mix of a significant volume of files, so i'd guess it should work rather well.
maff is still supported with this browser extension, there are other extensions supporting it as well
https://github.com/danny0838/webscrapbook
Last edit: andrew goh 2020-02-27
Hi,
really sorry, but at present I simply don't have any time to review and integrate code contributions.
Best regards
q:-) <= Quang
no worries, no hurry. but those who needs it can use the codes themselves
this solved a difficult problem for me, each maf zip archive is treated as a file rather than a folder instead of indexing the internal contents which will always have a file name of index.html.