5_-_Extending_the_normaliser

Allan Cunliffe

This step will focus on extending our basic normaliser to do something mildly useful. Based on the previous steps, we will be parsing foo files, and writing out the parts within the part tag. The opening tag will be "foo", which is already done. We also have to look at the escape characters.

Alter the normaliser parse method

The only part of the normaliser that will need to be changed is the parse method. First off, lets open the XenaInputSource, and get an InputStream that we can read bytes from and then turn it into a character stream. Since we know that the character set will be ASCII or UTF-8, we don't need to worry about guessing the character set, if we were required to do this however, a number of helper methods exist within Xena to allow this to be done. Initially, let's just write out every character we get into the content of the opening foo element.

       public void parse(InputSource input) {  
                ContentHandler contentHandler = getContentHandler();
                AttributesImpl openingAttribute = new AttributesImpl();         
                contentHandler.startElement(FOO_URI, FOO_OPENING_ELEMENT_LOCAL_NAME, 
                                                        FOO_OPENING_ELEMENT_QUALIFIED_NAME, 
                                                        openingAttribute);
                BufferedReader reader = new BufferedReader(source.getCharacterStream());
                char nextChar;
                while ((nextChar = (char)reader.read()) != -1) {
                        char[] newCharArray =  {nextChar};
                        contentHandler.characters(newCharArray, 0, 1);
                }
                contentHandler.endElement(FOO_URI, FOO_OPENING_ELEMENT_LOCAL_NAME, 
                                                        FOO_OPENING_ELEMENT_QUALIFIED_NAME);
        }

So far so good, but it doesn't do a great deal toward getting our Foo contents into the desired schema. So, what is next? Well the first thing to do will be to remove the opening "~beginFoo". The magic number is already defined as a static final string in the FooGuesser. We can update our parse method as follows:

       public void parse(InputSource input) {
                ContentHandler contentHandler = getContentHandler();
                AttributesImpl openingAttribute = new AttributesImpl();         
                contentHandler.startElement(FOO_URI, FOO_OPENING_ELEMENT_LOCAL_NAME, 
                                                        FOO_OPENING_ELEMENT_QUALIFIED_NAME, 
                                                        openingAttribute);
                BufferedReader reader = new BufferedReader(source.getCharacterStream());
                long magicNumberLength = (new Integer(FooGuesser.FOO_MAGIC.length)).longValue();
                reader.skip(magicNumberLength);
                char nextChar;
                while ((nextChar = (char)reader.read()) != -1) {
                        char[] newCharArray =  {nextChar};
                        contentHandler.characters(newCharArray, 0, 1);
                }
                contentHandler.endElement(FOO_URI, FOO_OPENING_ELEMENT_LOCAL_NAME, 
                                                        FOO_OPENING_ELEMENT_QUALIFIED_NAME);
        }

Add support for checking the character

Now we will add support for checking the character, and if it is a '~', we will close the part tag, and open a new one. But, so as to make sure it is valid XML, we will make sure that we actually have an element open before closing it. We will also need a final close element when we have no more input (as long as we have processed at least one Foo Part).

        public static final String FOO_PART_ELEMENT_LOCAL_NAME = "part";
        public static final String FOO_PART_ELEMENT_QUALIFIED_NAME = "foo:part";
        public void parse(InputSource input) throws SAXException {
                ContentHandler contentHandler = getContentHandler();
                AttributesImpl openingAttribute = new AttributesImpl();         
                contentHandler.startElement(FOO_URI, FOO_OPENING_ELEMENT_LOCAL_NAME, 
                                                        FOO_OPENING_ELEMENT_QUALIFIED_NAME, 
                                                        openingAttribute);
                BufferedReader reader = new BufferedReader(source.getCharacterStream());
                int nextCharVal;
                boolean startedElement = false;
                while ( (nextCharVal = reader.read() ) != -1) {
                        char currentChar = (char)nextCharVal;
                        if (currentChar == '~') {
                                // Don't close the element if we haven't already started one!
                                if (startedElement) {
                                        contentHandler.endElement(FOO_URI, FOO_PART_ELEMENT_LOCAL_NAME, FOO_PART_ELEMENT_QUALIFIED_NAME);
                                        startedElement = false;
                                }
                                contentHandler.startElement(FOO_URI, 
                                                                        FOO_PART_ELEMENT_LOCAL_NAME, 
                                                                        FOO_PART_ELEMENT_QUALIFIED_NAME, 
                                                                        partAttribute);
                                contentHandler.endElement(FOO_URI, 
                                                                        FOO_PART_ELEMENT_LOCAL_NAME, 
                                                                        FOO_PART_ELEMENT_QUALIFIED_NAME);
                        } else {
                                char[] newCharArray =  {nextChar};
                                contentHandler.characters(newCharArray, 0, 1);
                        }
                }
                // Don't close the element if we haven't already started one!
                if (startedElement) {
                        contentHandler.endElement(FOO_URI, FOO_PART_ELEMENT_LOCAL_NAME, 
                                                        FOO_PART_ELEMENT_QUALIFIED_NAME);
                }
                contentHandler.endElement(FOO_URI, FOO_OPENING_ELEMENT_LOCAL_NAME, 
                                                        FOO_OPENING_ELEMENT_QUALIFIED_NAME);
        }

Test the normaliser

After running normaliser tester, the output now looks like this:

#java -cp ../../../xena/xena.jar;foo.jar au.gov.naa.digipres.xena.demo.foo.test.NormaliseTester
/home/dpuser/workspace/plugin-howto/03_basic_normaliser_part_ii/foo_plugin/dist
Here is the best guess returned by Xena:
Guess... type: Foo
possible: Unknown
dataMatch:True
magicNumber: True
extensionMatch: True
mimeMatch: Unknown
certain: Unknown
priority: Default
-----------------------------------------
Here are the results of the normalisation:
Normalisation successful.
The input source name file:/home/dpuser/workspace/plugin-howto/03_basic_normaliser_part_ii/foo_plugin/dist/../../../data/example_file.foo
normalised to: example_file.foo_Foo.xena
with normaliser: "Foo"
to the folder: /home/dpuser/workspace/plugin-howto/03_basic_normaliser_part_ii/foo_plugin/dist
and the Xena id is: file:/../../../data/example_file.foo
-----------------------------------------

And here are the contents of our normalised file:

<xena>
        <meta_data>
                <meta_data_wrapper_name>Default Package Wrapper</meta_data_wrapper_name>
                <normaliser_name>au.gov.naa.digipres.xena.demo.foo.FooNormaliser</normaliser_name>
                <input_source_uri>file:/../../../data/example_file.foo</input_source_uri>
        </meta_data>
        <content>
                <foo:data xmlns:foo="http://preservation.naa.gov.au/foo/0.1">
                        <foo:part>this is the first part of the foo file</foo:part>
                        <foo:part>this is the second part. \</foo:part>
                        <foo:part>this is still the second part as we used the escape character.</foo:part>
                </foo:data>
        </content>
</xena>

Account for the escape character

So we can see that we are now successfully capturing the contents of the original foo file. However, we are not taking into account the escape character. This is a fairly simple modification - the only catch is to check for end of stream when we get the character following the escape. The only code that will change will be in the while loop within the parse method. For brevity, all else will be excluded. The code is written to maximise clarity rather than efficiency, and since we are going to simply print the character following the escape if it is a '~' or a '\', and ignore the escape otherwise, we can assume that following the escape we will always just write out the character if there is one.

              while ( (nextCharVal = reader.read() ) != -1) {
                        char currentChar = (char)nextCharVal;
                        if (currentChar == '~') {
                                contentHandler.startElement(FOO_URI, FOO_PART_ELEMENT_LOCAL_NAME, 
                                        FOO_PART_ELEMENT_QUALIFIED_NAME, 
                                        partAttribute);
                                contentHandler.endElement(FOO_URI, 
                                        FOO_PART_ELEMENT_LOCAL_NAME, 
                                        FOO_PART_ELEMENT_QUALIFIED_NAME);
                        } else if (currentChar == '\\') {
                                int escapedCharVal = reader.read();
                                if (escapedCharVal == -1) {
                                        break;
                                }
                                char escapedChar = (char)escapedCharVal;
                                char[] escapedCharArray = {escapedChar};
                                contentHandler.characters(escapedCharArray, 0, 1);
                        } else {
                                char[] newCharArray =  {nextChar};
                                contentHandler.characters(newCharArray, 0, 1);
                        }
                }

Here are the contents of the normalised file after the above changes were made, showing that we are now taking escaped characters into account:

<xena>
        <meta_data>
                <meta_data_wrapper_name>Default Package Wrapper</meta_data_wrapper_name>
                <normaliser_name>au.gov.naa.digipres.xena.demo.foo.FooNormaliser</normaliser_name>
                <input_source_uri>file:/../../../data/example_file.foo</input_source_uri>
        </meta_data>
        <content>
                <foo:data xmlns:foo="http://preservation.naa.gov.au/foo/0.1">
                        <foo:part>this is the first part of the foo file</foo:part>
                        <foo:part>this is the second part. ~this is still the second part as we used the escape character.</foo:part>
                </foo:data>
        </content>
</xena>

Related

Wiki: Main_Page