Compressing XML—Part I, Writing WBXML

ireless Binary XML (WBXML) is a compact representation of XML and is part of the presentation logic in Wireless Application Protocol (WAP). WBXML significantly improves the efficiency of transmitting XML over narrow bandwidth networks, where data size is of paramount importance.

Although WBXML was originally meant for wireless networks, it’s suitable for more than wireless applications. Conventional web applications can also take advantage of WBXML’s ability to carry the same information as that of its XML counterpart, but in a reduced size. With the advent of tools that can work directly with WBXML, there is no need to convert WBXML to XML before using it. You can generate SAX events directly from WBXML and even load a WBXML tree (DOM) in memory without converting it to XML.

WBXML is particularly useful in Electronic Data Interchange applications, where applications exchange huge volumes of XML data over the Internet. As an example, consider a database synchronization scenario, where proprietary formats are already converging to an open XML based grammar called “SyncML”. SyncML supports all the familiar database synchronization operations (such as add, delete, copy, search, etc.) and requires only the simple exchange of SyncML files over the Internet. After interested parties have successfully exchanged SyncML files, back-end tools automatically synchronize their data stores. Hence database synchronization becomes a simple exchange of XML files. Unfortunately, SyncML files are often large, which leads back to WBXML—you can improve the efficiency of SyncML file exchanges by reducing their size, and you can do that by representing them in WBXML format.

Generating WBXML Format from XML
Listing 1 contains a Wireless Markup Language (WML) file. WML is an XML based markup language similar to HTML, but optimized to serve as a presentation layer in the small, monochrome screens typically associated with WAP devices. I’ll show you how to convert WML to WBXML. Listing 2 shows the WBXML format resulting from the conversion process and Table 1 (see the last page of this article) gives descriptive notes for each byte in Listing 2.

Most of the WBXML format is not humanly readable; it contains bytes or octets in raw (non-textual or non-encoded) hexadecimal (hex) form. Two hex numbers represent one byte. For example, 0100 1000 in hex form is 0x48 (0x is the standard form of writing hex numbers; the number 4 represents 0100 and the number 8 represents 1000 in hex). As another example, the binary number 1100 1110 0111 1101 in hex is 0xCE 0x7D.

The first byte of a WBXML file represents the WBXML specification version number used in the file. The version uses a zero-based syntax where the number 0x03 means version 1.3 and 0x13 represents version 2.3. In this example, using WBXML version 1.3, the first byte of the WBXML file is 0x03. Version 1.3 is the latest version, called WAP 2.0 (see the June 2001 release on WAP Forum’s web site).

The sequence of bytes following the version number represents the Document Type Definition (DTD) of the XML file that you want to transform. There are two ways of doing this. First, you can include a well-known public ID for the DTD. If no public ID is available, you can include the DTD in its string form. Listing 2 uses the first method (0x04 for WML, byte 2 of Table 1).

Byte 3 contains the character-encoding declaration. WBXML requires an Internet Assigned Numbers Authority (IANA) MIBEnum value instead of a character encoding declaration. For further information about IANA MIBEnum values, refer to IANA web site (see the resources column). Table 1 shows the MIBEnum value for the UTF encoding used in the example, which is 106 (decimal) or 0x6A.

A String Table follows the character encoding. A string table is a reusable sequence of characters (yes, characters and not bytes) that you include once in a WBXML file and can refer to from anywhere else in the WBXML file. String tables reduce WBXML file size by using references to avoid inserting any sequence of characters more than once. The fourth byte specifies the total length (number of characters) in the string table. This example doesn’t use a string table; therefore byte 4 Listing 2 contains 0x00, which means length of the string table is zero.

Byte 5 represents the root element (). The example uses a standard WML DTD for which the WAP Forum has defined a WBXML encoding table. So in this case you can simply use the WML byte codes that the WAP Forum has defined in their WML specification and you don’t need to include any element names in the WBXML document. The WBXML specification allows you to use byte codes ranging from 0x05 to 0x 3f to specify tags, a range called the tag code space.

The WBXML specification also defines a code space for Global Tokens, which have special meaning for WBXML parsers, so you can’t use them as element codes. For example, 0x00 to 0x04 are Global Tokens.

You may need to add a numeric value to the byte code of each element, depending upon one of the following three scenarios:

  • When an element contains content (text nodes or child elements) but no attributes, you add a numeric value of 0x40to the byte code
  • When the element contains one or more attributes but no content, add 0x80to the element byte code
  • When the element contains both an attribute and content, add 0xC0 to the element bye code.

In the example in Listing 1, the root element has a byte code of 0x3F. Therefore byte 5 in Table 1 is 0x7f (0x3F + 0x40).

The Card element has byte code 0x27, so byte 6 is 0xE7 (0x27 + 0xC0).

When you include an element with attributes (such as byte 6 in Listing 1), the attribute list follows immediately after the element code. WBXML declares an attribute code space that overlaps with the tag code space (as this overlap does not produce any ambiguity). The global token 0x01 marks the end of the attribute code space —byte number 30 in Table 1. Attribute and element code spaces overlap with each other, but not with the code space for global tokens. Therefore all the content from bytes 6 to 30 in Table 1 is part of the attribute code space.

The attribute code space consists of attribute/value pairs. All valid WML attributes have specific codes in WML’s attribute code space. For example the byte 7 is 0x55, which is the code for the id attribute.

Attribute values follow the attribute codes. Each attribute value is a string—either an inline string or a reference into the string table. The example uses an inline string, so the eighth byte (0x03) is a global token signifying that the value is a NULL (0x00) terminated string. Bytes 9-12 form the in-line string, the value of the id attribute, and byte. 13 is the NULL byte specifying the end of the inline string.

The attribute/value pairs continue up through byte 30, after which you’ll see the code for a

element, 0x60 (0x20 + 0x40).

Byte 30 (0x01) refers to the end of an attribute-value pair list. Look at bytes 78, 100, 101 and 102. They all contain 0x01 as well, but those signify end of element. The global token 0x01 performs both jobs without ambiguity. When 0x01 occurs at the end of the attribute code space, it denotes the end of the attribute-value pair list. Otherwise, it denotes the end of an element.

The 0x01 token marks the close of a tag. Unlike standard XML, the closing mark doesn’t specify the name of the element that’s being closed. In effect, that means you can only write WBXML for well-formed XML.

You can follow the same logic pattern to match elements, attributes and values through the end of Table 1 and Listing 2.

Another Conversion Example
Listing 3 (XML) and Listing 4 (WBXML) contain a second conversion example. There is one difference between this example and the first one, this example shows the WBXML for a SyncML file. Unlike the WML file for the first example, there’s no DTD containing a WBXML encoding table. Therefore, the example uses string table references for the tag names. To do this, use the 0x04 global token to refer to the string table. To each byte, add 0x40, 0x80, or 0xC0 to this global token for attributes and content as discussed in the first example. Following the string table reference byte you’ll find an offset byte that tells how many bytes to skip from start of the string table before starting to read. For example, suppose you have the following string table:

   't' 'a' 'g' '1' 'NULL' 't' 'a' 'g' '2''NULL' 't' 'a' 'g' '3''NULL'

If you write 0x04 0x05, which specifies an element name, then you’re referring to tag2. The 0x04 means jump to the string table and 0x05 means read from fifth byte of the string table up to a NULL character.

In Part II of this article, you’ll see how to use WBXML on the server.

Table 1: WBXML Byte Code for the WML of Listing 1

Byte No. Byte Code. Description. Byte No. Byte Code. Description.
1 0x03 WBXML version 1.3. 91 0x50 p.
2 0x04 Well known public identifier. 92 0x72 r.
3 0x6A Charset = UTF-8 93 0x6F o.
4 0x00 String table length. 94 0x64 d.
5 0x7F with content. 95 0x75 u.
6 0xE7 with content and attributes. 96 0x63 c.
7 0x55 Id= 97 0x74 t.
8 0x03 Inline string follows. 98 0x2E .
9 0x48 H. 99 0x00 Inline String ends.
10 0x6F o. 100 0x01
11 0x6D m. 101 0x01
12 0x65 e. 102 0x01
13 0x00 Inline String ends. 103 0xE7 with content and attributes
14 0x36 Title attribute code. 104 0x55 id=
15 0x03 Inline String follows. 105 0x03 Inline string follows.
16 0x54 T. 106 0x50 P.
17 0x68 h. 107 0x72 r.
18 0x65 e. 108 0x6F o.
19 0x20 Space 109 0x64 d.
20 0x68 h. 110 0x75 u.
21 0x6F o. 111 0x63 c.
22 0x6D m. 112 0x74 t.
23 0x65 e. 113 0x00 Inline string ends.
24 0x20 Space 114 0x36 Title attribute code.
25 0x70 p. 115 0x03 Inline String follows.
26 0x61 a. 116 0x54 T.
27 0x67 g. 117 0x68 h.
28 0x65 e. 118 0x65 e.
29 0x00 Inline String ends. 119 0x20 Space
30 0x01 END (of card attribute list). 120 0x50 P.
31 0x60

with content.

121 0x72 r.
32 0x03 Inline String follows. 122 0x6F o.
33 0x20 Space 123 0x64 d.
34 0x57 W. 124 0x75 u.
35 0x65 e. 125 0x63 c.
36 0x6C l. 126 0x74 t.
37 0x63 c. 127 0x20 Space
38 0x6F o. 128 0x43 C.
39 0x6D m. 129 0x61 a.
40 0x65 e. 130 0x72 r.
41 0x2E . 131 0x64 d.
42 0x20 Space 132 0x00 Inline String ends.
43 0x54 T. 133 0x01 End of attribute list.
44 0x68 h. 134 0x60

with content.

45 0x69 i. 135 0x03 Inline String follows.
46 0x73 s. 136 0x20 Space
47 0x20 Space 137 0x57 W.
48 0x69 i. 138 0x65 e.
49 0x73 s. 139 0x6C l.
45 0x20 Space 140 0x63 c.
51 0x74 t. 141 0x6F o.
52 0x68 h. 142 0x6D m.
53 0x65 e. 143 0x65 e.
54 0x20 Space 144 0x2E .
55 0x48 H. 145 0x20 Space
56 0x6F o. 146 0x54 T
57 0x6D m. 147 0x68 h.
58 0x65 e. 148 0x69 i.
59 0x20 Space 149 0x73 s.
60 0x50 P. 150 0x20 Space
61 0x61 a. 151 0x69 i.
62 0x67 g. 152 0x73 s.
63 0x65 e. 153 0x20 Space
64 0x2E . 154 0x74 t.
65 0x00 Inline String ends. 155 0x68 h.
66 0xDC with content and attributes (1c +40+80). 156 0x65 e.
67 0x4A Href attribute code. 157 0x20 Space
68 0x03 Inline string follows. 158 0x70 p.
69 0x23 # 159 0x61 a.
70 0x50 P. 160 0x67 g.
71 0x72 r. 161 0x65 e.
72 0x6F o. 162 0x20 Space
73 0x64 d. 163 0x66 f.
74 0x75 u. 164 0x6F o.
75 0x63 c. 165 0x72 r.
76 0x74 t. 166 0x20 Space
77 0x00 Inline String ends. 167 0x70 p.
78 0x01 , the tag ends. 168 0x72 r.
79 0x03 Inline string follows. 169 0x6F o.
80 0x20 Space 170 0x64 d.
81 0x43 C. 171 0x75 u.
82 0x6C l. 172 0x63 c.
83 0x69 i. 173 0x74 t.
84 0x63 c. 174 0x73 s.
85 0x6B k. 175 0x2E .
86 0x20 Space 176 0x20 Space
87 0x66 f. 177 0x00 Inline String ends.
88 0x6F o. 178 0x01 ends
89 0x72 r. 179 0x01 ends
90 0x20 Space 180 0x01 ends

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: