Starting with Content Sources
SharePoint Portal Server indexing always starts with a content source
. The content source in SharePoint vernacular is the starting point for the indexing process. For a Web site, the content source identifies the URL from which SharePoint should derive all other URLs. The content source in SharePoint doesn't hold any of the indexed content; it's merely a pointer telling SharePoint where to start. Holding the indexed content is the responsibility of the Content Index.
Filtering Content Indexes
Content Indexes are the repositories of the indexed information, but they also serve another very important purpose: they're the way that SharePoint lets you filter what information the index retains. SharePoint has a set of rules that define what items are included or excluded.
SharePoint processes the rules from the top downin other words, when indexing content, the first matching rule in the list determines what the gatherer does with the content, either including or excluding it. In most cases, particularly with Web sites, the content may contain links that lead to places that you don't want SharePoint to index. For instance, if your intranet has links to the Microsoft support site, you most likely do not want your intranet index to include the Microsoft's support pages.
Setting Up a META ROBOTS Tag
In addition to the Content Index rules, another important way to control the index contents is via a META
tag containing a NAME
attribute with a value of ROBOTS
. This tag tells the crawler what it is supposed to do with the page. The tag takes the form:
<META NAME="robots" CONTENT="noindex" />
In the content
attribute, you can specify an index
directive, a follow
directive, or both, separated by a comma. The index
directives are "index" and "noindex." They tell the crawler whether it should or should not include the content of the page in its index. The follow
directive is either "follow" or "nofollow." It indicates whether the pages that are linked from the current page should be followed and indexed according to their own META
tags or not. If both directives are used, a comma is placed between them to separate them. If no META ROBOTS
tag is present, the crawler will assume it can index the page and can follow all links.
The META ROBOTS
tag is a key part of the strategy that allows SharePoint to index your custom application. To get SharePoint to index your custom application, you link to a special gatherer start page from your home page. This special gatherer start page contains a link to every page in the site you want indexed. Then you set a META ROBOTS
tag to read "noindex, follow." The result is that SharePoint will follow all the links embedded in that page, but won't index the page itself.
Creating Your Own Starting Page
So, to get SharePoint to index your entire application, you need only create a page that contains links to all the site's pages that you want indexeda list of the pages that together would display all the data in the application to be searched. The listing page simply provides a set of links to each of the pages the crawler needs to visit during the indexing process.
Typically, data pages (those that display the real meat of the application) accept query string parameters that the listing page embeds within the links. The result is that the indexer is told to repeatedly index the same page, but each time with different query parameters and, therefore, with different content. By creating the page with a link to pages that will display all of the important data in the site, or at least further link to pages where the crawler can find the data in the system, you enable the gathering of all of the information in the system.
Creating a Target Page
If your application does not already have a page that will serve to display the data from the starting page, then you will need to create your own target page for each of the links. In most cases, this is not necessary, because there is a suitable target page that displays data and can be modified to support the users' needs, as well as the needs of the gatherer.
Another consideration for the target pageand the reason why the target page so often needs to be modifiedis that the page should be as free from repetitive data as possible. That is because this extraneous information is not only useless to the gatherer, but it actually reduces its effectiveness. Most pages have a menu, and most menus don't change from page to page, or users would never be able to find their way around the site. So if, for instance, your menu contains the word "administration," then any time a search is requested with the word "administration," every page in the site would come up. That's not exactly the result you are looking for.