Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Using Gambit-C Scheme to Create Small, Efficient Native Applications : Page 4

When you need to build small, native standalone applications, Gambit-C Scheme provides a high-level dynamic language with good runtime and memory performance, as well as good deployment options.


advertisement

Find Names in Input Text

I use Gambit-C Scheme for the development of a commercial product that extracts semantic information from plain text. The example in this section is a much-simplified version that extracts only people's names (and does so with a simple algorithm so the example will be easy to understand). The source code download for this article contains the file proper-names.scm, which contains the complete example.

The example uses lists, vectors, and hash tables. I'll show you the basics of using hash tables while going through the code.

Hash Tables

I start by defining hash tables for first names and last names. In the following code snippet, notice that I use * characters at the beginnings and endings of global variables. Doing so makes it easier to recognize global variables in code. I will use the function init-hashes (defined in the file people-data-tables.scm) to fill these two hash tables:


(define *first-name-hash* #f) (define *last-name-hash* #f)

The Scheme function for getting a value from a hash table is table-ref, which takes two required arguments and an optional third argument. The first argument is a hash table, the second argument is the key value that you are looking up in the hash table, and the third optional argument is the desired return value if the key is not in the hash table. The file people-data-tables.scm, which was automatically generated from a linguistic database that I use, calls the table-set! function to add key/value pairs to a hash table. This generated file contains only about 5000 names. I generated it requesting only frequently used names.

Lists

In addition to hash tables containing first and last names, I use a list containing commonly used name prefixes:

(define *name-prefix-list* '("Mr" "Mrs" "Ms" "Gen" "General" "Maj" "Major" "Doctor" "Vice" "President" "Lt" "Premier" "Senator" "Congressman" "Prince" "King" "Representative" "Sen" "St" "Dr"))

The main function in this example is find-human-names, which takes a vector of strings as an argument and returns a list of strings that are the names found in the input vector. If the input argument is a string, then it is tokenized into a vector of individual words. The example handles names containing four words, three words, two words, and one word. It starts looking for longer names first, and when it finds one, it skips forward to the end of the name before looking for another name.

Here is the beginning of the function definition with only the code for finding two word names:

(define (find-human-names word-vector) (if (not *first-name-hash*) ; load the first and last name hash table data the first time (let () ; this function is called (set! *first-name-hash* (make-table size: 10000)) (set! *last-name-hash* (make-table size: 1000)) (init-hashes *first-name-hash* *last-name-hash*))) ;; if word-vector is a string, then tokenize it: (if (string? word-vector) (set! word-vector (list->vector (string-tokenize word-vector)))) ;; loop over each word in word-vector, checking to see if it is part of a name: (let ((ret '()) (ret3 '()) (x '()) (len (vector-length word-vector)) (word #f) (found-at-index #f)) (dotimes (i len) (set! found-at-index #f) (set! word (vector-ref word-vector i)) ;; process 2 word names: (if (and (not found-at-index) (< i (- len 1))) (if (or (and (member word *name-prefix-list*) ; see is word is a prefix like "Mr" (table-ref *last-name-hash* (vector-ref word-vector (+ i 1)))) ;
is word at index+1 a last name? (and (table-ref *first-name-hash* word) ; is word a first name? (table-ref *last-name-hash* (vector-ref word-vector (+ i 1))))) ;
is word at index+1 a last name? (let () ; yes, we found a name, so store the indices of the beginning
and ending of the name: (set! ret (cons (list i (+ i 2)) ret)) (set! i (+ i 1)) (set! found-at-index #t))))

The indices for new names are added to the beginning of the list ret (I will reverse the order later). Getting strings for names found in the text (without any duplicates) only takes about 10 lines of code. The code will be easier to understand after you review example results from calling this function. Here is what the contents of the list ret looks like for this example:

> (load "people-data-tables.scm") "/Volumes/Data/ideas/DevX/Gambit-C apps/code/name-finder/people-data-tables.scm" > (load "proper-names.scm") "/Volumes/Data/ideas/DevX/Gambit-C apps/code/name-finder/proper-names.scm" > (find-human-names '#("President" "Bush" "went" "to" "San" "Diego" "to" "meet" "Ms" "." "Jones" "and"
"Gen" "." "Pervez" "Musharraf" "." "President" "Bush" "played" "golf" ".")) ("President Bush" "Ms. Jones" "Gen. Pervez Musharraf") >

Notice that the duplicate name "President Bush" is removed. After the input word vector has been scanned for four-, three-, two-, and one-word names, the list ret that contains starting and ending indices looks like this:

((17 19) (12 16) (8 11) (0 2))

For debugging, you can use the Scheme trace function to "instrument" any other function so its arguments and return value are printed each time the instrumented function is called:

> (trace find-human-names) > (find-human-names '#("President" "Bush" "went" "to" "San" "Diego" "to" "meet" "Ms" "." "Jones" "and"
"Gen" "." "Pervez" "Musharraf" "." "President" "Bush" "played" "golf" ".")) | > (find-human-names '#("President" "Bush" "went" "to" "San" "Diego" "to" "m... | ("President Bush" "Ms. Jones" "Gen. Pervez Musharraf") ("President Bush" "Ms. Jones" "Gen. Pervez Musharraf") > (define (fact n) (if (< n 2) 1 (* n (fact (- n 1))))) > (fact 5) 120 > (trace fact) > (fact 5) | > (fact 5) | | > (fact 4) | | | > (fact 3) | | | | > (fact 2) | | | | | > (fact 1) | | | | | 1 | | | | 2 | | | 6 | | 24 | 120 120 >

Map Function

The code uses the input word vector (the list of starting/ending indices is listed below) and the function map. A pair of short examples in the following code demonstrate two ways to use map.

> (define a-list '(1 2 3 4 5 6 7)) > (map (lambda (n) (+ n 1)) a-list) (2 3 4 5 6 7 8) > (let ((sum 0)) (map (lambda (n) (set! sum (+ sum n))) a-list) sum) 28 >

The first argument to map can be a function or a lambda expression. I first use map in the expected way for functional programming: map creates a new list in which each element is the value of applying the first functional argument to the corresponding element of the input list. My second use of map is to discard the list that map returns as its value and to use the lambda expression to calculate a sum of values in the input list using side effects (changing the value of the locally bound variable sum).

Author's Note: This example is contrived to show you the map function. The appropriate function to use here is for-each, which does not construct a return list.

The use of map in the name finder program is like the second example that you have just seen. The value of ret3 is an empty list that will hold the strings returned as the value for this function:

(map (lambda (index-pair) (let ((str "")) (dolist (s (vector->list (subvector word-vector (car index-pair) (cadr index-pair)))) (if (equal? s ".") (set! str (string-append str s)) (if (equal? str "") (set! str s) (set! str (string-append str " " s))))) (if (not (member str ret3)) (set! ret3 (cons str ret3))))) (reverse ret)) (reverse ret3)))

As you now know, the function map takes two arguments: a function and a list. This example does not use the returned value from map (which is unusual); it just uses the side effects of the anonymous lambda function filling the list ret3. This anonymous lambda function captures the value from the outer variable word-vector. Subvector returns part of the vector that is passed as its first argument. In this case, I pull out the strings for an entire name from the input word vector. Vector->list is a Scheme utility function that converts a vector to a list.

Here is a shorter example to show you how subvector, string-append, and vector->list work:

> (define v '#("the" "dog" "ran" "down" "the" "street")) > (subvector v 1 3) #("dog" "ran") > (string-append "123" "456") "123456" > (string-append "123" "456" "cat") "123456cat" > (vector->list v) ("the" "dog" "ran" "down" "the" "street") >

I use the dolist macro to loop through this list of words in a name and use the function string-append to concatenate them. After a string containing an entire name is constructed, I add it to the beginning of the list ret3. The list ret3 (in reversed order) is the return value for the function find-human-names.

The next section uses this name-finding example in an implementation of a simple web application.



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap