Included in this article is an implementation of the above actor model for retrieving remote recipes from multiple sites in multiple formats. Each recipe is listed in one or more index files on the web, and the recipe is in HTML. The program retrieves these silos of information, harvests meaningful data, indexes it, and makes it available in a graphical user interface.
|Author's Note: Read the terms and conditions of any web site before harvesting its contents.|
With the stage set, let's introduce the actors:
|RoundRobin||UrlConsumer||Distributes URLs to other actors|
|UrlResolver||UrlConsumer||Retrieves data streams for another actor|
|XhtmlTransformer||StreamTransformer||Formats HTML into XHTML for parsing|
|StyleSheetTransformer||StreamTransformer||Converts remote XML format into local data format|
|RdfParser||StreamConsumer||Parses data stream into data structure|
|SeeAlsoExtractor||RdfConsumer||Extracts URLs from index data|
|IngredientProcessor||RdfConsumer||Applies local processing rules on data|
|RDFInserter||RdfConsumer||Inserts data into a database
Listing 4 shows how these actors are connected to one another. The manage() methods are typed versions of the ActorManager#manage(Object) in Listing 3.
A ClusterMap and Main class are also provided in the download archive. To run the example, execute the Main class with
the following two arguments:
Figure 1. ClusterMap: The tortilla soup recipe is revealed after clicking certain ingredients.|
The Main class then opens the ClusterMap and begins harvesting the recipes. After a few recipes are harvested, select the check-box on the left to see the number of recipes that are harvested and click the clear button at the top to update the list of words extracted from the ingredients section. In this way, you can index and search multiple distinct recipe sites. For example, to find recipes that include lemon, cheddar, and garlic (yum), click on these ingredients and the Tortilla Soup recipe is revealed to include all three ingredients from the recipes harvested (see Figure 1
In a multi-core system, the program uses over 30 threads to orchestrate the retrieval and processing of the data—downloading and processing as quickly as the remote host provides the data. In spite of the multi-threaded performance, there is no need to consider typical multi-threaded challenges, freeing the developer from worrying about the constraint on what each actor should do.
The actor model is a powerful metaphor to assist in creating multi-threaded applications, and by assigning remote addresses and enabling remote communication between actors, you can extend the model to assist in distributed challenges as well. By including life-cycle and dependency management and making actors aware of their environment, they can become agents, participating in a self-organizing system. This architecture has worked well for many distributed problems such as on-line trading, disaster response, and modelling social structure. It has also been the source of inspiration for many service-oriented architectures.
In essence, the actor model abstracts the nitty-gritty of multi-processor programming away from the developer. This reduces concurrency issues and improves the flexibility of the system. This simple model has a low learning curve, so new developers can quickly see how actors are implemented and understand how they fit together. By managing the actors properly, you can leverage the same implementations from multi-processor systems onto distributed networked systems in a gradual manner that can scale with the development demands.