harePoint Portal Server's search engine is one of its most powerful features, able to search file shares, Exchange public folders, Notes databases, and Web sites; however, the results you get when you index a Web site may be a bit overwhelming. In this article, you'll see how SharePoint's search feature operates and how you can leverage its behavior to index your custom application properly.
How SharePoint Search Works
SharePoint search works using a gatherer
process that crawls a configurable set of content. While SharePoint crawls sites, it performs two simultaneous processes. First, it collects all the words in the content being indexed, and second, identifies other content (via links) that it should also crawl. These two processes work together to create an index that contains every word from every piece of linked content.
During indexing, a filtering process limits how much of the newly discovered content SharePoint indexes. This filtering process, discussed in more detail later in this article, prevents the crawler from crawling the entire network and the entire Internet as do public search engines such as Google. Typically, you'd set this filtering process so that SharePoint indexes only files in a particular file share or only Web pages on a particular Web server. SharePoint adds any other content encountered during the crawl to the list of pages to be crawled; however, the gatherer rejects content that doesn't match the filter rules before processing it.
When crawling a Web page, the gatherer collects all the words in the document. It simultaneously gathers all the anchor tags (links) in the page. SharePoint adds the gathered words to the document index, and moves the URLs in the anchor tags into the list of Web pages yet to be crawled. When the gatherer is ready for the next page, it retrieves the next item from that list and verifies that the item meets the filter criteria. If it does not, it is rejected and SharePoint reads next item. When an item does meet the filter criteria, SharePoint first checks to ensure that it has not already crawled that item; if not, it proceeds with the crawl, adding the item's content to the index and its links to the list of items to be crawled. This process repeats until every piece of content is read.
The process for indexing a file share is very similar; however, there SharePoint follows a different process for directories than for files. When the gatherer encounters a file, it indexes the content in that file and puts the words into the index as described above. However, when it encounters a directory, no content is added to the index; instead, it adds each file in the directory and all of its subdirectories to the list of content to crawl.
So, why doesn't this scheme work well for custom applications? The reason is that most custom Web applications don't have plain anchor links to every possible page in the system and every possible query string parameter. One core task when connecting your custom application to SharePoint for indexing is providing a way for it to identify what content is available.
Protocol Handlers and IFilters
The high-tech way to let SharePoint understand what content is available in your application is to create a protocol handler and, in some cases, an IFILTER. The protocol handler is an extension to the gatherer that lets it handle different protocols in the provided URLs. Much like your Web browser understands http, ftp, mailto, and news (nntp) protocols in URLs, you can give SharePoint Portal Server search to connect to your own protocol.
With a protocol handler, you can make up your own protocol; for example "mca," meaning "my custom application." After installing your protocol handler, SharePoint will call your protocol handler whenever it encounters a URL that begins with your new protocol. Thus, to index your application, you would simply add your protocol handler into SharePoint and then tell SharePoint to index your custom protocol, providing whatever additional information is necessary. The protocol handler can then identify all of the different elements to be added to the list of items to be indexed.
In contrast, IFILTERs don't handle protocols; they handle file types returned by the protocol handlers. Natively, SharePoint has IFILTERs for Web pages, Microsoft Office documents, Adobe Acrobat files, and a variety of other file types. The IFILTERs are responsible for extracting words to be indexed from that particular file type. The gatherer determines which IFILTER to use via MIME type rules for the document that was returned by the protocol handler and invoking the right IFILTER for that file type. In your high-tech solution, you might also provide an IFILTER to process a custom MIME type, allowing you to create custom indexing for a page in your application. Of course, that means making the page return a custom MIME type to the search engine.
Overall, the high-tech solution using a protocol handler and IFILTERs provides the most precise level of control for searching a custom application. However, the cost for that tight control is a great deal of effort to create these extensions to SharePoint and make the associated changes to the indexed applications. If you're willing to give up a little of that control and provide some support, there's a much easier way to index your custom Web application using the built-in SharePoint indexing facilities.