Using SharePoint Portal Server to Index Your Custom Application

harePoint Portal Server’s search engine is one of its most powerful features, able to search file shares, Exchange public folders, Notes databases, and Web sites; however, the results you get when you index a Web site may be a bit overwhelming. In this article, you’ll see how SharePoint’s search feature operates and how you can leverage its behavior to index your custom application properly.

How SharePoint Search Works
SharePoint search works using a gatherer process that crawls a configurable set of content. While SharePoint crawls sites, it performs two simultaneous processes. First, it collects all the words in the content being indexed, and second, identifies other content (via links) that it should also crawl. These two processes work together to create an index that contains every word from every piece of linked content.

During indexing, a filtering process limits how much of the newly discovered content SharePoint indexes. This filtering process, discussed in more detail later in this article, prevents the crawler from crawling the entire network and the entire Internet as do public search engines such as Google. Typically, you’d set this filtering process so that SharePoint indexes only files in a particular file share or only Web pages on a particular Web server. SharePoint adds any other content encountered during the crawl to the list of pages to be crawled; however, the gatherer rejects content that doesn’t match the filter rules before processing it.

When crawling a Web page, the gatherer collects all the words in the document. It simultaneously gathers all the anchor tags (links) in the page. SharePoint adds the gathered words to the document index, and moves the URLs in the anchor tags into the list of Web pages yet to be crawled. When the gatherer is ready for the next page, it retrieves the next item from that list and verifies that the item meets the filter criteria. If it does not, it is rejected and SharePoint reads next item. When an item does meet the filter criteria, SharePoint first checks to ensure that it has not already crawled that item; if not, it proceeds with the crawl, adding the item’s content to the index and its links to the list of items to be crawled. This process repeats until every piece of content is read.

The process for indexing a file share is very similar; however, there SharePoint follows a different process for directories than for files. When the gatherer encounters a file, it indexes the content in that file and puts the words into the index as described above. However, when it encounters a directory, no content is added to the index; instead, it adds each file in the directory and all of its subdirectories to the list of content to crawl.

So, why doesn’t this scheme work well for custom applications? The reason is that most custom Web applications don’t have plain anchor links to every possible page in the system and every possible query string parameter. One core task when connecting your custom application to SharePoint for indexing is providing a way for it to identify what content is available.

Protocol Handlers and IFilters
The high-tech way to let SharePoint understand what content is available in your application is to create a protocol handler and, in some cases, an IFILTER. The protocol handler is an extension to the gatherer that lets it handle different protocols in the provided URLs. Much like your Web browser understands http, ftp, mailto, and news (nntp) protocols in URLs, you can give SharePoint Portal Server search to connect to your own protocol.

With a protocol handler, you can make up your own protocol; for example “mca,” meaning “my custom application.” After installing your protocol handler, SharePoint will call your protocol handler whenever it encounters a URL that begins with your new protocol. Thus, to index your application, you would simply add your protocol handler into SharePoint and then tell SharePoint to index your custom protocol, providing whatever additional information is necessary. The protocol handler can then identify all of the different elements to be added to the list of items to be indexed.

In contrast, IFILTERs don’t handle protocols; they handle file types returned by the protocol handlers. Natively, SharePoint has IFILTERs for Web pages, Microsoft Office documents, Adobe Acrobat files, and a variety of other file types. The IFILTERs are responsible for extracting words to be indexed from that particular file type. The gatherer determines which IFILTER to use via MIME type rules for the document that was returned by the protocol handler and invoking the right IFILTER for that file type. In your high-tech solution, you might also provide an IFILTER to process a custom MIME type, allowing you to create custom indexing for a page in your application. Of course, that means making the page return a custom MIME type to the search engine.

Overall, the high-tech solution using a protocol handler and IFILTERs provides the most precise level of control for searching a custom application. However, the cost for that tight control is a great deal of effort to create these extensions to SharePoint and make the associated changes to the indexed applications. If you’re willing to give up a little of that control and provide some support, there’s a much easier way to index your custom Web application using the built-in SharePoint indexing facilities.

Starting with Content Sources
SharePoint Portal Server indexing always starts with a content source. The content source in SharePoint vernacular is the starting point for the indexing process. For a Web site, the content source identifies the URL from which SharePoint should derive all other URLs. The content source in SharePoint doesn’t hold any of the indexed content; it’s merely a pointer telling SharePoint where to start. Holding the indexed content is the responsibility of the Content Index.

Filtering Content Indexes
Content Indexes are the repositories of the indexed information, but they also serve another very important purpose: they’re the way that SharePoint lets you filter what information the index retains. SharePoint has a set of rules that define what items are included or excluded.

SharePoint processes the rules from the top down?in other words, when indexing content, the first matching rule in the list determines what the gatherer does with the content, either including or excluding it. In most cases, particularly with Web sites, the content may contain links that lead to places that you don’t want SharePoint to index. For instance, if your intranet has links to the Microsoft support site, you most likely do not want your intranet index to include the Microsoft’s support pages.

Setting Up a META ROBOTS Tag
In addition to the Content Index rules, another important way to control the index contents is via a META tag containing a NAME attribute with a value of ROBOTS. This tag tells the crawler what it is supposed to do with the page. The tag takes the form:

   

In the content attribute, you can specify an index directive, a follow directive, or both, separated by a comma. The index directives are “index” and “noindex.” They tell the crawler whether it should or should not include the content of the page in its index. The follow directive is either “follow” or “nofollow.” It indicates whether the pages that are linked from the current page should be followed and indexed according to their own META tags or not. If both directives are used, a comma is placed between them to separate them. If no META ROBOTS tag is present, the crawler will assume it can index the page and can follow all links.

The META ROBOTS tag is a key part of the strategy that allows SharePoint to index your custom application. To get SharePoint to index your custom application, you link to a special gatherer start page from your home page. This special gatherer start page contains a link to every page in the site you want indexed. Then you set a META ROBOTS tag to read “noindex, follow.” The result is that SharePoint will follow all the links embedded in that page, but won’t index the page itself.

Creating Your Own Starting Page
So, to get SharePoint to index your entire application, you need only create a page that contains links to all the site’s pages that you want indexed?a list of the pages that together would display all the data in the application to be searched. The listing page simply provides a set of links to each of the pages the crawler needs to visit during the indexing process.

Typically, data pages (those that display the real meat of the application) accept query string parameters that the listing page embeds within the links. The result is that the indexer is told to repeatedly index the same page, but each time with different query parameters and, therefore, with different content. By creating the page with a link to pages that will display all of the important data in the site, or at least further link to pages where the crawler can find the data in the system, you enable the gathering of all of the information in the system.

Creating a Target Page
If your application does not already have a page that will serve to display the data from the starting page, then you will need to create your own target page for each of the links. In most cases, this is not necessary, because there is a suitable target page that displays data and can be modified to support the users’ needs, as well as the needs of the gatherer.

The key features of the target page are that it emits the data into the page to be indexed. It should also emit it in such a way that it is not buried in the middle of JavaScript or other HTML tags that the gatherer will not index by default.

Another consideration for the target page?and the reason why the target page so often needs to be modified?is that the page should be as free from repetitive data as possible. That is because this extraneous information is not only useless to the gatherer, but it actually reduces its effectiveness. Most pages have a menu, and most menus don’t change from page to page, or users would never be able to find their way around the site. So if, for instance, your menu contains the word “administration,” then any time a search is requested with the word “administration,” every page in the site would come up. That’s not exactly the result you are looking for.

The Sample Application
The downloadable code accompanying this article contains a sample Web application that demonstrates the concepts discussed here, and that you can use to practice with. The sample application contains three pages: default.aspx, crawler.aspx, and detail.aspx. Each demonstrates the concepts laid out here. Default.aspx is the home page of the application, which includes the link (with no text in the middle) to crawler.aspx. That generally looks something like:

   

Crawler.aspx is the search start page. It contains a loop that adds 100 links to detail.aspx, creating a link with a different query string value for each loop iteration. The core code for Crawler.aspx is:

?
Figure 1. Link Page: The file crawler.aspx contains a list of links for SharePoint to index.
   for(int looper=0;looper < 100;looper++)   {      HtmlAnchor ctlAnchor = new HtmlAnchor();      LiteralControl ctlBr = new LiteralControl();      ctlAnchor.HRef = "detail.aspx?id=" + looper;      ctlAnchor.InnerText = "Detail " + looper;      ctlBr.Text = "
"; frmControl.Controls.Add(ctlAnchor); frmControl.Controls.Add(ctlBr); }

This causes a page to appear as shown in Figure 1.

When the loop shown above completes, detail.aspx contains a list of links for the gatherer to follow. The content for those links is simple; for each request Detail.aspx displays a short message, including the query string, in its response. Figure 2 and Figure 3 show the content that the gatherer will index when it follows the first two links).

?
Figure 2. Crawling Links: The figure shows the content that detail.aspx returns when the gatherer crawls it using a query string ID of 0.
?
Figure 3: Varying Parameters to Obtain Different Content: The figure shows the content that detail.aspx returns when the gatherer crawls it using a query string ID of 1.

You can install the sample application for this article in a Web site and have SharePoint index it to see the process in action. Note, however, that you can not install the application in a SharePoint Portal Server or Windows SharePoint Services extended IIS virtual server, as indexing does not work properly for applications installed in a SharePoint extended virtual server.

Differentiating Search User Agents from Humans
When indexing, you generally want to clear the target page of as much information as possible?including data essential for humans?so it’s often best to differentiate between a search engine and a user to get the best results. In other words, when the Web request originates with a user, you want to render the page normally, but when the Web request comes from a search engine, you want to suppress the rendering of standard items such as headers, footers, menus, related items, announcements, etc., so that they’re not indexed with the page.

You do this by evaluating the UserAgent property of the Request object. Each HTTP request includes a user-agent header that identifies the type of browser making the request. SharePoint includes “SPRobot” in the user agent string when it issues a request. So, if you find “SPRobot” in the user agent string, you then know that the request is coming from the SharePoint indexing engine, and you can suppress unnecessary items.

Considerations for Indexing Your Site
A few quick considerations when enabling indexing for your custom application:

  • SharePoint supports only HTTP-based authentication. If your application relies upon forms-based authentication to validate users, you will have to devise another means of validation.
  • The gatherer will index the content only one time, and once indexed, that content will be available for everyone via search. You do not want to index information that is sensitive or should be available only to certain users.
  • The gatherer’s job is to gather and index content as quickly as possible. This often creates a load higher than most people can reasonably test. Be prepared for a massive load generated on the custom application when SharePoint is indexing the site. In fact, we’ve used search engines like SharePoint to do initial load testing for us.

Configuring SharePoint
Once you have created your pages, you need to configure SharePoint Portal Server to index them, a process that involves setting up a new content source, content index, and search scope. The following sections list the steps you should follow.

Creating a Content Index
The Content Index is the database that will hold the indexed information. Follow these steps to create the Content Index.

  1. From the portal home page, click Site Settings.
  2. From the site settings page in the Search Settings and Indexed Content section, click Configure Search and Indexing.
  3. If you see a link, Enabled Advanced Search Administration Mode, click it to enable the features necessary to complete this process.
  4. Scroll down on the Configure Search and Indexing page to the Content Indexes section and click the Add Content Index link.
  5. On the Create Content Index page, enter a meaningful name, description, and a source group. The source group is a grouping for naming purposes; generally, naming this the same as the index name is appropriate. Finally, select the server to perform the indexing and click OK.

Creating a Content Source
The content source is the starting point where SharePoint will begin the gathering process. Follow these steps to create the content source.

  1. Click the Configure Search and Indexing breadcrumb link.
  2. On the Configure Search and Indexing page, in the Other Content Sources section, click the Add Content Source link.
  3. Select the content index you created above in the Select a Content Index drop-down box. Click the next button.
  4. Enter the URL for the home page of your custom application, enter a description, and select the source group you created for the Content index. Click the Finish button.

After creating the content source, SharePoint returns you to the content index properties page. Now you can configure the content index so that it only indexes your application.

Configuring the Content Index
Configuring the content index is the process of setting rules that tell the indexer how to handle the content that it finds. Follow these steps to configure the content index.

  1. On the Manage Index Properties page, in the Rules to Exclude and Include Content section, click the Manage Rules to Exclude and Include Content link.
  2. Click the New Rule link or icon.
  3. Enter the path of the application to index, ending with a slash (/) and an asterisk (*), click the Include Complex URLs (URLs That Contain Question Marks (?)) link, and click the OK button.
  4. Hover over the server name of the newly added URL and then select the context menu from the drop down on the right. Click the Move Up option from the context menu.

Now that the content index and content source have been setup and configured, there is only one more step?which is technically optional. That step is to create a new search scope.

New Search Scope
SharePoint uses search scopes to display information to users searching through the portal’s Web interface. By creating a search scope, you make it possible to search your custom application separately from any other content that SharePoint has indexed. For example, you would create a new search scope to make your custom application searchable separately from all other documents on the network. Another good reason to create a search scope is if the part numbers, order numbers, or other key identifiers that people use to search your custom application also appear elsewhere on the network. Providing a search scope isolates your application’s search results from that other information, helping prevent confusion as to where the best place is to get information about the items associated with that identifier.

To create a new search scope, follow these steps:

  1. On the Exclude and Include content page, click the Configure Search and Indexing breadcrumb link.
  2. On the Configure Search and Indexing page in the General Content Settings and Indexing Status, click the Manage Search Scopes link.
  3. On the Manage Search Scopes page, click the New Search Scope link or icon.
  4. Enter the name for the search scope, click the Include No Topic Or Area In This Scope radio button, click the Limit The Scope To The Following Groups Of Content Sources radio button, click the check box next to the source group that you created above, and click the OK button.
  5. Start a command prompt and run the IISRESET command to restart IIS. That causes the new search scope show up immediately.

After creating the new search scope, you’ll be able to search for information within that scope from the portal. Before you can do that, however, you must first perform the indexing process.

Starting the Gathering Process
Before you can search, the gatherer has to index your application. Here are the steps to kick off the gathering process, which will create the index of your application.

  1. On the Manage Search Scopes page, click the Site Settings breadcrumb link.
  2. On the Site Setting page in the Search Settings and Indexed Content section, click the Configure Search and Indexing link.
  3. On the Configure Search and Indexing page in the Content Indexes section, click the Manage Content Indexes link.
  4. Hover over the content index you created above, select the drop down on the right hand side of the name, and click Start Full Update from the menu.

Over the next few minutes, the page will automatically refresh, showing you the status of your new index. You can search your custom application from within the portal after that page returns to idle again and displays the number of documents in the index.

So Little Time for So Much Information
For smaller applications, you can create a search index in only a few minutes that corresponds to the information in your application, enhancing it by providing a way for users to perform a full text search, or simply providing people already using SharePoint Portal Server with a single location to perform searches. For larger applications, the effort required to implement SharePoint indexing varies; but in most cases, setting up the index takes only a few hours of time and can greatly improve people’s ability to find information.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: