RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Build Your Own Lightweight XML DOM Parser : Page 3

Microsoft's MSXML parser is rich in functionality, but in some cases a full-featured parser is too large for resource-limited environments. Don't count XML out yet though; you can write your own lightweight VB XML parser in fewer than 400 lines of code!

Defining XML Parsing Primitives (cont'd)
Each time the parser reads a tag, it calls the parseTag routine, which creates a new SimpleElement object, obtains the tag name and attributes, and then decides what to do with the new element.

Private Sub parseTag(ByVal s As String)
   Dim tagName As String
   Dim se As SimpleElement
   ' is this a CDATA tag?
   If isCDATA(s) Then
      ' add the text contents of this tag
      ' to the last element on the stack
      Call addCDATA(s)
      ' get the tag name
      tagName = getTagName(s)
      If tagName <> "" Then
         ' create a new SimpleElement
         Set se = New SimpleElement
         se.name = tagName
         ' get all the attributes for this tag
         Call getAttributes(se, s)
         ' is this a close tag?
         If isCloseTag(s) Then
            Call PopElement(se)
            ' it's a child tag or root tag
            Call pushElement(se)
         End If
      End If
   End If
End Sub
Parsing the tag name and attributes is relatively simple because almost all XML tags follow consistent patterns. The name always follows the opening "<" character and white space is not allowed in tag names. However, there are a few special tags, such as the XML declaration and processing instructions, which begin with "<?", and DOCTYPE and entity definitions and comments, which begin with "<!". This implementation ignores those tags. CDATA tags also begin with "<!", but the SimpleDOMParser treats these in a special way.

Private Function getTagName(ByVal s As String) _
   As String
   Dim tagName As String
   Dim i As Integer
   For i = 1 To Len(s)
      Select Case Mid$(s, i, 1)
      Case "<", "/"
         ' ignore
      Case ">", "[", " "
         ' stop parsing
         getTagName = tagName
         Exit Function
      Case "!", "?"
         ' ignore this tag
         ' you can add additional checks for the 
         ' xml declaration, DOCTYPE elements, 
         ' and comments
         getTagName = "" ' this is a CDATA tag
         Exit Function
      Case Else
         tagName = tagName & Mid$(s, i, 1)
      End Select
   Err.Raise 50000, "SimpleDOMParser.getTagName", _
      "The tag " & s & " is malformed."
End Function
Similarly, all attributes follow the pattern name="value", where the attribute name is always preceded by white space, and the value is always quoted, although the quote character may be either a single or a double quote. After extracting the tag name, any remaining text within the tag must consist of attributes. To parse the attributes, you can simply search ahead for the next expected character, first the "=" sign, which must appear between the attribute name and its value, and then the single or double quote characters, which delimit the attribute value.

After pushing an element on the stack, the parser next looks for text content using the getTagContent() function, passing the main XML string and the current index position to begin searching. The function simply looks for text that occurs between the current index position and the next opening tag character (<), and, if found, adds that text to the last element on the stack.

Private Function getTagContent(s As String, i As Long)
   Dim endpos As Long
   Dim content As String
   endpos = InStr(i, s, "<")
   If endpos > 0 And endpos > i Then
      If endpos - i > 0 Then
         content = Trim$(Mid$(s, i, (endpos - i)))
         i = endpos
         Call setTextContent(content)
      End If
   End If
End Function

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date