Language: VB5,VB6,VBS
Expertise: Intermediate
Jul 7, 2001



Extract words with the RegExp object

The following routine extracts all the words from a source string and returns a collection. Optionally, the result contains only unique words.

This code is remarkably simpler than an equivalent "pure" VB solution because it takes advantage of the RegExp object in the Microsoft VBScript Regular Expression type library.

' Get a collection of all the words in a string
' If the second argument is True, only unique words are returned
' NOTE: requires a reference to the
'       Microsoft VBScript Regular Expression type library

Function GetWords(ByVal Text As String, Optional DiscardDups As Boolean) As _
    Dim re As New RegExp
    Dim ma As Match
    ' the following pattern means that we're looking for a word character (\w)
    ' repeated one or more times (the + suffix), and that occurs on a word
    ' boundary (leading and trailing \b sequences)
    re.Pattern = "\b\w+\b"
    ' search for *all* occurrences
    re.Global = True
    ' initialize the result
    Set GetWords = New Collection
    ' we need to ignore errors, if duplicates are to be discarded
    On Error Resume Next
    ' the Execute method does the search and returns a MatchCollection object
    For Each ma In re.Execute(Text)
        If DiscardDups Then
            ' if duplicates are to be discarded, we just add a key to the 
            ' collection item
            ' and the Add method will do the rest
            GetWords.Add ma.Value, ma.Value
            ' otherwise just add to the result
            GetWords.Add ma.Value
        End If
End Function
Here is an example of how you can use the routine:

' Count how many articles appear in a source string
' held in the txtSource textbox control
Dim v As Variant
Dim count As Long

For Each v In GetWords(txtSource.Text)
    Select Case LCase$(v)
        Case "the", "a", "an"
            count = count + 1
    End Select
MsgBox "Found " & count & " articles."

Francesco Balena
