The Sample Application
The
downloadable code accompanying this article contains a sample Web application that demonstrates the concepts discussed here, and that you can use to practice with. The sample application contains three pages:
default.aspx,
crawler.aspx, and
detail.aspx. Each demonstrates the concepts laid out here.
Default.aspx is the home page of the application, which includes the link (with no text in the middle) to
crawler.aspx. That generally looks something like:
<a href="./crawler.aspx"></a>
Crawler.aspx is the search start page. It contains a loop that adds 100 links to
detail.aspx, creating a link with a different query string value for each loop iteration. The core code for
Crawler.aspx is:
 | |
| Figure 1. Link Page: The file crawler.aspx contains a list of links for SharePoint to index. |
for(int looper=0;looper < 100;looper++)
{
HtmlAnchor ctlAnchor = new HtmlAnchor();
LiteralControl ctlBr = new LiteralControl();
ctlAnchor.HRef = "detail.aspx?id=" + looper;
ctlAnchor.InnerText = "Detail " + looper;
ctlBr.Text = "<BR/>";
frmControl.Controls.Add(ctlAnchor);
frmControl.Controls.Add(ctlBr);
}
This causes a page to appear as shown in
Figure 1.
When the loop shown above completes,
detail.aspx contains a list of links for the gatherer to follow. The content for those links is simple; for each request
Detail.aspx displays a short message, including the query string, in its response.
Figure 2 and
Figure 3 show the content that the gatherer will index when it follows the first two links).
 | |
| Figure 2. Crawling Links: The figure shows the content that detail.aspx returns when the gatherer crawls it using a query string ID of 0. |
|
 | |
| Figure 3: Varying Parameters to Obtain Different Content: The figure shows the content that detail.aspx returns when the gatherer crawls it using a query string ID of 1. |
|
You can install the
sample application for this article in a Web site and have SharePoint index it to see the process in action. Note, however, that you can not install the application in a SharePoint Portal Server or Windows SharePoint Services extended IIS virtual server, as indexing does not work properly for applications installed in a SharePoint extended virtual server.
Differentiating Search User Agents from Humans
When indexing, you generally want to clear the target page of as much information as possibleincluding data essential for humansso it's often best to differentiate between a search engine and a user to get the best results. In other words, when the Web request originates with a user, you want to render the page normally, but when the Web request comes from a search engine, you want to suppress the rendering of standard items such as headers, footers, menus, related items, announcements, etc., so that they're not indexed with the page.
You do this by evaluating the
UserAgent property of the Request object. Each HTTP request includes a user-agent header that identifies the type of browser making the request. SharePoint includes "SPRobot" in the user agent string when it issues a request. So, if you find "SPRobot" in the user agent string, you then know that the request is coming from the SharePoint indexing engine, and you can suppress unnecessary items.