RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


A Java Developer's Guide to Ruby : Page 5

Ruby's versatility and flexibility complement Java well. That's why a Java developer who can program in Ruby is more effective and efficient than one who programs only in Java.

Ruby Regular Expressions
Ruby has built-in support for handling regular expressions using the class Regexp. Java's java.util.regex APIs offer similar functionality but regular expression support in Ruby definitely has a more native feel to it. You can create a regular expression object by either directly using a method call like Regexp.new("[a-e]og") or enclosing a regular expression between slash characters like /[a-e]og/. You can find good tutorials on both regular expressions and on Ruby's regular expression support on the web; this simple example shows only using the =~ operator:
=> 4
>> "the dog ran" =~ /[a-e]og/
=> 4
>> "the zebra ran" =~ /[a-e]og/
=> nil

Ruby Network Programming
Ruby has a great standard library for network programming as well. Please see my previous DevX article on this subject. I frequently use Ruby for collecting data from the Internet, parsing it, and then storing it in XML or a database.

Ruby Document Indexing and Search Using the Ferret Library
By now, you have installed the Ruby gem called ferret. Ferret is the fastest indexing and search library based on Java Lucene (even faster than the Common Lisp version, Montezuma). One interesting fact about the Ferret library is that during development the author David Balmain eventually wrote most of it in C with a Ruby wrapper. The lesson is that if you start to use Ruby and have performance problems, you can always recode the time-critical parts in C or C++. Ferret defines a few classes that you will use in your own applications once you adopt Ruby:

  • Document represents anything that you want to search for: a local file, a web URL, or (as you will see in the next section) text data in a relational database.
  • Field represents data elements stored in a document. Fields can be indexed or non-indexed. Typically, I use a single indexed (and thereby searchable) text field and then several "meta data" fields that are not indexed. Original file paths, web URLs, etc. can be stored in non-indexed fields.
  • Index represents the disk files that store an index.
  • Query provides APIs for search.

Indexing and Searching Microsoft Word Documents
The following is the Ruby class I use for reading Microsoft Word documents and extracting the plain text, which is an example of using external programs in Ruby:

class ReadWordDoc
  attr_reader :text
  def initialize file_path
    @text = `antiword #{file_path}`   # back quotes to run external program

The "trick" here is that I use the open source antiword utility to actually process Word document files. You can run any external program and capture its output to a string by wrapping the external command in back quotes. Try the following under Linux or OS X (for Windows try `dir`):

puts `ls -l`

This example prints the result of executing the external ls (Unix list directory) command.

The following Ruby script enters a Word document into an index (plain text files are easier—try that as an exercise):

require 'rubygems'
require 'ferret'
include Ferret
include Ferret::Document
require 'read_word_doc' # read_word_doc.rb defines class ReadWordDoc

index = Index::Index.new(:path => './my_index_dir')  # any path to a directory

doc_path = 'test.doc'                   # path to a Microsoft Word
doc_text = ReadWord.new(doc_path).text  # get the plain text from the Word file
doc = Document.new
doc << Field.new("doc_path", doc_path, Field::Store::YES, Field::Index::NO)
doc << Field.new("text", doc_text, Field::Store::YES, Field::Index::TOKENIZED)
index << doc

index.search_each('text:"Ruby"') do |doc, score|  # a test search
  puts "result: #{index[doc]['doc_path']} : #{score}"    # print doc_path meta data
  puts "Original text: #{index[doc]['text']}"            # print original text

index.close  # close the index when you are done with it

Notice how short this example is. In 24 lines (including the class to use antiword for extracting text from Word documents), you have seen an example that extracts text from Word, creates an index, performs a search, and then closes the index when you are done with it. Using Ruby enabled you to get complex tasks done with very few lines of code. Had you coded this example in Java using the very good Lucene library (which I've done!), the Java program would be much longer. Shorter programs are also easier and less expensive to maintain.

This example uses Word documents, but OpenOffice.org documents are simple enough to be read. With about 30 lines of pure Ruby code, you can unzip a document and extract the text from the content.xml element in the unzipped XML data stream. (XML processing is simple in Ruby, but it is beyond the scope of this article.)

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date