Example 3 - Preservation and String Table Options

This example shows how to preserve information beyond standard elements and attributes, and how to set limits on the String Table.

What Example 3 Demonstrates

Nagasena always preserves the integrity of XML content (values in elements and attributes). Non-essential information, such as comments, processing instructions, namespace declarations, and DTDs, are omitted by default to achieve maximum compression. These instructions are most often used to make XML files more human-readable. Once prepared and encoded, EXI files usually will not need to be edited directly after they are transferred and decoded.

However, if it is important for your use case to preserve information beyond the logical document content, you can still use Nagasena to achieve significant size reduction. You have the option of retaining some or all of these values when you encode an EXI file.

Example 3 adds checkboxes for Boolean choices and text fields that enable you to set value limits. You can set values, encode and decode your own XML files. You can see how the different settings affect the compression when the file is encoded and the readability when the file is decoded.

These are the options introduced in this example:

Preserve Comments

Retains comments in the standard form <!-- comment text -->.

Preserve Programming Instructions

Retains PIs in the standard form <? pi content ?>.

Preserve DTD

Retains Document Type Declaration and Entity Reference information.

Preserve Namespaces

Retains namespace declarations. Also retains prefixes used in elements and attributes.

Preserve Lexical Values

By default, Nagasena replaces the text strings representing information such as numbers and dates with a more efficient equivalent. For example, the number 65000 could be represented as a 2-byte integer rather than a 5-byte character string. Preserve Lexical Values retains the string values rather than allowing Nagasena to convert them to numerical equivalents.

Preserve Whitespace

Retains spaces, tabs, and line feeds to maintain the original indentation.

Element/Attribute Value Block Size

Sets the maximum number of values (not characters!) for elements and attributes in the EXI encoded file. Block Size is used to enable streaming, to some extent, when the compression option is selected. The default block size is intentionally set very high (1,000,000). This requires a large buffer to read and process all of the information in a single block. Machines that have limited memory might need to use a smaller block size to support compression.

String Table Max Value Length

The first time a string appears in a file during processing, Nagasena automatically adds the string to a table for reuse. This can greatly improve performance and file compression when there are redundant data values. However, if a data set has large strings with unique values, it might not make sense for those strings to go into the table. By setting a maximum length for strings in the table, you can limit the entries to shorter strings that are more likely to repeat in the data set.

The default value is -1, which means strings of any length are added to the string table.

String Table Max Value Partitions

A partition is an entry in the string table. Setting a value limits the number of strings that can be created in the string table. Once the table is filled, the partitions are reused in round-robin fashion as new values are added. This can help prevent machines with lower capacity from exceeding their memory resources.

The default value is -1, which means that there is no limit to the number of strings that can be added to the string table.

How to Use Example 3

To install and run Example 3:

  1. Download and expand OpenEXI_Example3.zip. This zip archive contains the compiled example application classes and Java source code. Expanding the file creates a directory of name "OpenEXI_Example3".
  2. From command line, move into the "OpenEXI_Example3" directory.
  3. Enter the command:
    java -jar OpenEXI_Example3.jar

To encode an XML file to EXI:

  1. Click Browse... to select an XML file to encode. The selected file name appears in the Source File field. A suggested name is displayed in the Destination File field, but you can edit the location or file name according to your needs.
  2. Use the radio buttons to choose an alignment type (byte-aligned documents are the easiest to examine with a text editor, if you want to peek inside).
  3. Select checkboxes to try different preservation settings.
  4. Enter a new integer value for Element/Attribute Block Size.
  5. Enter new integer values for String Table Max Value Length and String Table Max Value Partitions.
  6. Click Encode.

To decode an EXI file to XML:

  1. Click Browse... to select an EXI file to decode. The selected file name appears in the Source File field. A suggested name is displayed in the Destination File field, but you can edit the location or file name according to your needs.
  2. Set the Alignment, preservation checkboxes, and integer values to the same values used to encode the file.

    Note:You might notice in the code that there is no option for preserving whitespace in the EXIReader. If whitespace is preserved when the file is encoded, no setting is required when the file is decoded.
  3. Click Decode.

Code Highlights

Complete, commented source code is included in the src directory in OpenEXI_Example3.zip. This section highlights the important updates in each iteration as the examples build on one another.

EncodeEXI

The method signature is much longer, now that there are so many options available. In practice, you will hard code the values, and only those you require to handle your special processing needs.

    public void encodeEXI(
        String sourceFile, 
        String destinationFile,
        String alignment,
        Boolean preserveComments,
        Boolean preservePIs,
        Boolean preserveDTD,
        Boolean preserveNamespace,
        Boolean preserveLexicalValues,
        Boolean preserveWhitespace,
        int blockSize,
        int maxValueLength,
        int maxValuePartitions
    ) 

Preservation of XML elements in the file is handled with GrammarOption settings. Using the addCM, addPI, addDTD, and addNS methods increments the options variable with a corresponding numerical value. The sum of the values indicates which options are selected. In practice, you can compute the final value yourself and set the options variable directly if your settings are not likely to change.

        if (preserveComments) options = GrammarOptions.addCM(options);
        if (preservePIs) options = GrammarOptions.addPI(options);
        if (preserveDTD) options = GrammarOptions.addDTD(options);
        if (preserveNamespace) options = GrammarOptions.addNS(options);

Boolean values for preserving whitespace and lexical content are set directly in the Transmogrifier.

        transmogrifier.setPreserveLexicalValues(preserveLexicalValues);
        transmogrifier.setPreserveWhitespaces(preserveWhitespace);

For integer values, the code tests to see if they are different from the defaults. If so, it passes the new value to the Transmogrifier.

        // Set the number of elements and attributes to be processed as a block.
        if (blockSize!=1000000) transmogrifier.setBlockSize(blockSize);
            
        // Set the maximum length for values stored in the String Table for reuse.
        if (maxValueLength>-1) transmogrifier.setValueMaxLength(maxValueLength);
            
        // Set the maximum number of values stored in the String Table.
        if (maxValuePartitions >-1) 
            transmogrifier.setValuePartitionCapacity(maxValuePartitions);

The options variable was set to the value GrammarOptions.DEFAULT_OPTIONS (equal to 2) in the previous examples. With the additional options set here, the values will range from 2-255, depending on your selections.

            grammarCache = new GrammarCache((EXISchema)null, options);
            transmogrifier.setGrammarCache(grammarCache);

GrammarOptions, combined with the values set directly in the Transmogrifier, give you control over the content that is preserved or omitted from the final encoded file.

DecodeEXI

The method signature for decodeEXI has expanded similarly to the encodeEXI method. However, there is no need for the preserve whitespace option: if the whitespace is preserved when encoded, EXIReader will retain it when decoding the file.

    public void decodeEXI(
            String sourceFile, 
            String destinationFile,
            String alignment,
            Boolean preserveComments,
            Boolean preservePIs,
            Boolean preserveDTD,
            Boolean preserveNamespace,
            Boolean preserveLexicalValues,
            int blockSize,
            int maxValueLength,
            int maxValuePartitions
    )

Set the values in the EXIReader as they were set for the Transmogrifier during encoding. Again, there is no setting required for preserving whitespace in EXIReader.

        // Set preservation preferences in Grammar Options
        if (preserveComments) options = GrammarOptions.addCM(options);
        if (preservePIs) options = GrammarOptions.addPI(options);
        if (preserveDTD) options = GrammarOptions.addDTD(options);
        if (preserveNamespace) options = GrammarOptions.addNS(options);

        // Set preservation preferences handled directly in the transmogrifier.
        reader.setPreserveLexicalValues(preserveLexicalValues);
            
        // Set the number of elements processed as a block.
        if (blockSize!=1000000) reader.setBlockSize(blockSize);
            
        // Set the maximum length for values stored in the String Table for reuse.
        if (maxValueLength>-1) reader.setValueMaxLength(maxValueLength);  
            
        // Set the maximum number of values stored in the String Table.
        if (maxValuePartitions >-1) 
        reader.setValuePartitionCapacity(maxValuePartitions);

Pass the options variable to the GrammarCache, set the EXISchema for EXIReader, and the file is ready to be decoded per your instructions.

            grammarCache = new GrammarCache((EXISchema)null, options);
            reader.setGrammarCache(grammarCache);

This example demonstrated how to set preservation options to retain information beyond the logical content of an XML file. Example 4 shows how to use a schema to enable still greater document compression.


Updated August 23, 2013.
Tutorial by Dennis Dawson with Takuki Kamiya of Fujitsu Laboratories of America.