Menu

#14 On file upload use parser registered for the given file extension

in-progress
None
bug
major
server
Unassigned
2014-06-25
2014-06-25
Kevin Black
No

On file upload use parser registered for the given file extension.

Where no parser has been registered for a particular file extension then use the generic parser (the current English text parser). This mechanism can be used in connection or along side the mechanism specified in Ticket SF #13 [Trac #430].

Trac #432.

Discussion

  • Kevin Black

    Kevin Black - 2014-06-25

    Commit 281 (CCF) Ticket #432. Added FileExtentionUtils and tests for extracting the file extension from a filename. At some point we may want to use org.apache.commons.io.FilenameUtils instead of this custom utility, but probably not until we convert the project to use maven.

    Commit 282 (CCF) Ticket #432. Added ResourceParserRegistry which will initialize registered ResourceParsers the first time they are requested and will thereafter return that already initialized parser. Updated DocumentUploaderServlet to use the ResourceParserRegistry instead of ResourceParsers directly. Will need to make the ResourceParser.init method refresh the resource parsers so that paragraphs and sentences are correctly numbered.

    Commit 283 (CCF) Ticket #432. BasicResourceParser: updated section headers and moved the getRawResourceValue method into the new "Support" section.

    Commit 284 (CCF) Ticket #432. Added Section headers to all of the ResourceParsers that extend BasicResourceParser. Added a TODO comment to the SentenceResourceParser to indicate that the default regex is not being used, despite the comment to that effect. Corrected the exception logging report in TokenResourceParser to refer to "tokenizer model" instead of "sentence detector model".

    Commit 285 (CCF) Ticket #432. This resolves the issue introduced in commit 282 where paragraphs and sentences were numbered incorrectly. Added "specializedInit" method to BasicResourceParser which the Paragraph and Sentence ResourceParsers override to reset their index counter which is used when generating paragraph and sentence resource values.

    Commit 286 (CCF) Ticket #432. ResourceParserRegistry: Updated javadoc and section comment.

    Commit 287 (CCF) Ticket #432. Added ResourceParserBuilder interface along with its internal Parameters class. Renamed ResourceParsers to ResourceParserBuilders and reworked the englishTextResourceParser method into an englishTextResourceParserBuilder object. Updated ResourceParserRegistry to properly invoke the englishTextResourceParserBuilder object.

    Commit 288 (CCF) Ticket #432. Changed ResourceParserRegistry public and internal interfaces to require ResourceParserBuilder.Parameter objects instead of separate DataProviderClient and ServletContext objects. Updated DocumentUploaderServlet to call the updated interface properly.

    Commit 289 (CCF) Ticket #432. Reworked the internal ResourceParserFetcher hierarchy in ResourceParserRegistry such that the fetchers store a builder and invoke the builder to fetch/materialize the parser when requested for the first time. It is possible to extend the hierarchy to have an AlwaysFetcher fetcher, or a fetcher that releases an inactive parser after a time (say it becomes to expensive to hold it all in memory).

    Commit 290 (CCF) Ticket #432. Refined javadoc for ResourceParserRegistry.

    Commit 329 (CCF) Ticket #432. Updated ResourceParserRegistry.getParserByFileExtension(String, ResourceParserBuilder.Parameters) to use FileExtensionUtils.getPotentialExtensions and then find the ResourceParserFetcher for the most specific extension possible. Tested by parsing a file named "01GEN.sample.syr" which was still parsed as a SYR Object resource.

     
  • Kevin Black

    Kevin Black - 2014-06-25
    • status: new --> in-progress
     

Log in to post a comment.

MongoDB Logo MongoDB