Wednesday, May 18, 2011

How To Configure SharePoint 2010 Search To Index SQL Server Databases

Chris Domino, Director, Enterprise Architect

I love selling SharePoint as an application platform. When I'm in client conversations or architectural debates regarding SharePoint development, I usually end up referencing my three "tiers" of SharePoint applications, depicted as follows:

The sizes of the boxes are sized relative to what you can deliver feature-set wise. I've been able to file every SharePoint project I've done in the last six years into one of these categories. The vast majority of my work has been in the middle box: branding SharePoint with custom functionality. Even though you get to build a lot of cool stuff in these types of projects, you must be deeply entrenched in SharePoint to do so. How many times have you been writing CAML or provisioning lists or modeling content types and angrily caught a coworker out of the corner of your eye happily hanging out in SQL Management Studio?

Even though our tools are getting better and better, SharePoint development is still hard.

That's why I love SharePoint as a backend. You can leverage all the stuff SharePoint is really good at: search, a "free" image / document repository, security delegation, scalability, Office integration, etc. without having to deal with the inevitable challenges intrinsic to any CMS technology. My last major project followed this very high-level architecture. It was a whiz-bang Silverlight 4 application, with EF 4, SQL 2008 R2, .NET RIA Services, and Prism as supporting characters.

But it was also a SharePoint app! Our performance and branding requirements pointed directly to a custom application. However, the client also wanted to leverage their existing SharePoint environment. The solution? Use both! This opened up a lot of really interesting integration points for my dev team. You can read the Office Integration post for one of them. But nearest to my heart among these was using SharePoint as the search provider for our data-driven Silverlight app.

In order to make this happen, we had to do a lot of work across the application's layouts. On the front end, we needed to define the metadata that would be bound to the various search result UIs in our app. The backend reintroduced me to writing insane SQL views to surface the data SharePoint would index (my t-SQL fortunately hadn't gotten too rusty since I let EF into my heart). And in the middle was not only all the configuration required to make SharePoint able to index our data, but a WCF service as well to make it available down on the client.

This was my general approach to customize search:

  1. Create the database. SQL owns the application data; EF models are generated from it.
  2. Create the views. Views were created for each "search scope." I put search scope in quotes here because they are not SharePoint search scopes...yet. At this point they were just views that served the data required for the different search areas of the app.
  3. Create the External Systems. SharePoint Designer 2010 (Oh yeah!) was used to create a separate "Data Connection" (aka "External System") for each SQL view.
  4. Create the External Content Types. Then, still in designer, I created an External Content Type (ECT) for each External System and configured the necessary ECT Operations for data access. These ECTs are finally beamed up to the default Business Data Connectivity (BDC) service application in Central Admin.
  5. Configure BCS. Thank you very much SharePoint Designer 2010! That would have required a lot of XML if we had been in 2007. Now we log onto the server and fire up Central Administration. The first thing to do here is set up security for the BDC Metadata Store, as well as permissions for each ECT. Also, make sure to set up your crawl account (which SharePoint calls the "Default content access account").
  6. Configure Content Sources. Next I configured a Content Source for each BDC External Content Type created in the last step. This completes our "chain" of mappings: one SQL view to one External Content Type to one External System to one Search Content Source.
  7. Discover Crawled Properties. Now it's time to run a full crawl of all of our sources. This will create a Crawled Property for each column in our SQL views.
  8. Create Managed Properties.When that's done, find an intern and have him or her create Managed metadata properties. Each column in each SQL view surfaced by our ECTs will show up as a Crawled Property in the "Business Data" category of the Search service application; these need to be mapped to new Managed Properties so they can be used as returnable columns in our search query. You could also consider a slick PowerShell script that reads your schema and does this for you. PowerShell is the best intern ever.
  9. Create Search Scopes. Nope, not done yet. The penultimate step is to create a Search Scope (no quotes this time) for each search area in your application. For example, our app is comprised of four tabs: one for the main search, and one each for the three major "pages" of content. Each page has its own search functionality for that content. Therefore, I have a total of four Search Scopes to create, each populated by a Content Source that's querying an External Content Type that's backed by a SQL view.
  10. Configure crawl schedules. Finally, I configured the full and incremental crawl schedules for my Content Sources, and kicked off another full crawl. I like to run a full crawl every night, and an incremental one every few hours, depending on my requirements (essentially how long it's acceptable to have "stale" content in your results). Once the Search Scopes update themselves (you can force this in "Search Administration" which is the "home page" of the Search service application) you are ready to query!

The aim of this post isn't to go into a great amount of "how-to" detail on the above points; plenty of other blogs have really good detailed instructions on the minutia of these configurations. These are the same resources I used to help me come up with the approach. Instead, I wanted to share some of the issues I ran into along the way. The aforementioned blogs are great for "how-to's" but don't address any of the "what-if's." I am all about the "what-if's."

So now I'll go back through the procedure above and annotate them with my $0.02, as well as any problems, headaches, or anxiety I got along the way.

  1. Create the database. Search is usually the last hurdle to clear in my projects, and normally this isn't a problem - especially if it's SharePoint. However, with an application like this, it would have been nice to look at search toward the beginning to get an idea of what the views would look like. We would have designed our data model a little different to relieve the complexity of these queries. Additionally, we had to reshuffle our site hierarchy (by consolidating three site collections into one) to make some other things we left for the end of the project work. It's the same lesson: consider as much as possible before diving in!

  2. Create the views. Even though SharePoint search is indexed behind the scenes on different servers, and has the concept of incremental crawls, performance of your views still matters. We filled some of our tables with over a million rows of test data before release. The views used in search based off these tables easily took over a half hour to churn in Management Studio, and timed out the crawler. I don't have hard numbers (regarding how long the timeout period was or how long the query took or how big the dataset was) to articulate any guidance; try keep your queries sane. If they take longer to run in Management Studio than you're willing to sit and watch and wait for, it's a safe bet SharePoint will be just as impatient.

  3. Create the External Systems. In this step, recall that I created a separate data source (External System) in SharePoint Designer for each ECT. It's possible to create multiple ECTs for a single data source, but doing things this way didn't give me the granularity I needed in Central Admin's BCS service application. Basically, when you create a new Search Content Source based off a BCS model, you can only choose among the External Systems, not the individual ETCs. So I decided to be explicit and create an External System for each External Content Type so that each could be mapped to its own Search Content Source.

  4. Create the External Content Types. A new feature of BCS in 2010 is that it has full support for CRUD operations. But for searching external data, we only need two: ReadItem and ReadList (for each ECT). But why both? When I left off ReadList, I didn't get any results. When I did without a ReadItem operation, I got the following error in my crawl log: "Specific Finder MethodInstance not found in Entity." So just do both.

  5. Configure BCS. As we should be all too familiar with in SharePoint, there's a lot of security around search. You have to configure permissions for the content access account, as well as the metadata, each ECT, and then some. I've almost never not gotten an access denied error at some step along the way of search configuration. If you are getting one and can't figure out, just go through all the screens and grant full permissions to everything until it works. Then go backward, stripping of permissions one at a time, until you break it again. This is our sweet spot for the principle of least permissions.

  6. Configure Content Sources. In addition to having a separate content source for each ECT, we need an additional one for documents. Since our app used SharePoint as a document repository as well, we had to make sure we could index the content of these documents. Then matching content from our document libraries could be welded to our query results from the custom content sources. Since you can't delete the OOTB "Local SharePoint sites" content source, I repurposed it for this.

  7. Discover Crawled Properties. Here's a gotcha: if a column in your SQL view doesn't contain any data for any row (all nulls), a crawled property won't be extracted in SharePoint. Very frustrating. Make sure you create a few test entities before your first full crawl so there is at least one row populating each column.

  8. Create Managed Properties. If you experience the issue above, or find that your metadata gets very cluttered as you incrementally develop your search functionality, you might want to delete properties along the way and start from scratch with a full crawl. Deleting crawled properties is less intuitive than managed properties, because crawled properties are implicitly discovered whereas managed are explicitly created by you. To have deletions succeed for the former, make sure you clear all of the mappings and deselect the "Included in index" box in the property's properties (that's fun to say). The alternative is rebuilding your search service application or writing a PowerShell script to take care of it automatically.

  9. Create Search Scopes. Recall that we hijacked the OOTB "Local SharePoint sites" content source to crawl only our document libraries. To use these search items in your scopes, create a new rule of type "Web Address" (the default) and select "Folder" for the web address (also the default). Then type the beginning of the URL to the document library you want to include in the scope. The custom search service our Silverlight frontend hits has the logic to discern whether search results come from our views or from SharePoint document libraries. I'll have a bit more on these services at the end of this post.

  10. Configure crawl schedules. Finally, when your crawl runs to completion and actually contains items, you might wonder why the number of items in a content source is less than the number of rows returned from the corresponding query when run in Management Studio. This is because only unique rows are added to the index. You can fudge this a little by adding extra uniqueness-enforcing columns (like a uniqueidentifier column with a default value of "newid()"), or a smattering of composite primary keys created in Step #4 in SharePoint Designer.

Finally, I'd like to discuss the service I've alluded to a few times. This is a standard WCF endpoint that lives on the SharePoint server, and is called by the Silverlight frontend to facilitate search. Basically, the app sends the query and the search scope to the service, which then does all the magic needed to query both our database and the content stored within documents, merge the results together into a single array of DTOs, and send everything back down.

Let's look at some of the more interesting parts of this service. First, the following code shows my entry point into the SharePoint search API, and how I parsed the query to support key words (and quoted phrases).

  1. //build query
  2. using (FullTextSqlQuery fq = new FullTextSqlQuery(SPContext.Current.Site))
  3. {
  4. //separate query into single words and quoted phrases
  5. Regex regex = new Regex(@"\x22[^\x22]*\x22|\S+", RegexOptions.IgnoreCase);
  6. foreach (Match m in regex.Matches(query))
  7. sb.Append(string.Format("{0}{1}{0} AND ", m.Value.Contains("\"") ? string.Empty : "\"", m.Value));
  8. //set query
  9. string queryText = sb.ToString().Replace("'", "''");
  10. queryText = queryText.Substring(0, queryText.Length - 5);
  11. ...
  12. }

Next, (and this would be starting at Line #11 above) we add the metadata properties we want to search against. This is the "select" clause of our search query.

  1. sb = new StringBuilder("SELECT ");
  2. //sharepoint metadata
  3. sb.Append("Path, ");
  4. sb.Append("Rank, ");
  5. sb.Append("Title, ");
  6. sb.Append("IsDocument, ");
  7. sb.Append("LastModifiedTime, ");
  8. sb.Append("HitHighlightedSummary, ");
  9. //custom metadata (crawled properties mapped to managed properties)
  10. sb.Append("AppId, ");
  11. sb.Append("AppURL, ");
  12. sb.Append("AppDate, ");
  13. sb.Append("AppTags, ");
  14. sb.Append("AppTitle, ");
  15. sb.Append("AppFilterId, ");
  16. sb.Append("AppImageURL, ");
  17. sb.Append("AppPopularity ");

Now the "from" and "where" statements:

  1. //scope and where
  2. sb.AppendFormat("FROM SCOPE() WHERE \"scope\" = '{0}' AND (", scope);
  3. sb.AppendFormat(" CONTAINS(AppURL, '{0}') OR ", queryText);
  4. sb.AppendFormat(" CONTAINS(AppDate, '{0}') OR ", queryText);
  5. sb.AppendFormat(" CONTAINS(AppTags, '{0}') OR ", queryText);
  6. sb.AppendFormat(" CONTAINS(AppTitle, '{0}') OR ", queryText);
  7. sb.AppendFormat(" CONTAINS(AppImageURL, '{0}') OR ", queryText);
  8. sb.AppendFormat(" FREETEXT(DEFAULTPROPERTIES, '{0}') )", queryText);

Line #2 sets the scope, and Line #8 is what pulls in content from documents in Document Libraries. Now we configure the query and execute it:

  1. //configure query
  2. fq.RowLimit = 100;
  3. fq.TrimDuplicates = true;
  4. fq.EnableStemming = true;
  5. fq.QueryText = sb.ToString();
  6. fq.ResultTypes = ResultType.RelevantResults;
  7. //run query
  8. DataTable data = new DataTable();
  9. data.Load(fq.Execute()[ResultType.RelevantResults], LoadOption.OverwriteChanges);
  10. foreach (DataRow row in data.Rows)
  11. {
  12. ...
  13. }

The final bit goes in the for loop at Line #12. This is the logic that determines if a search result is from our application (the database) or content (SharePoint).