Tuesday, June 29, 2010

Using Sharepoint 2007 Search via custom code

Recently, I’ve been working on a Sharepoint 2007 farm that has 500+ site collections, each representing a project.  The plan is to potentially have several GB of (scanned) documents in each of these projects, so they were set up as separate site collections to allow them to be split across different content databases, as well as be able to be backed up and restored independently if needed.

I’ve been tasked with modifying a custom webpart that rolls up some information (tasks) from these individual site collections.  The original version of this webpart iterated through every site collection on the web application, found the task list, and loaded the tasks.  It worked with 20-30 site collections (but was slow), but would timeout with 500+.  A few weeks ago, I made a couple of quick tweaks to reduce the amount of list-scanning used and properly dispose objects after their use and got it down to a 50s load-time on those 500 site collections, but could get no further.  Creating 500+ SPSite objects just isn’t fast, and SPSiteDataQuery doesn’t work across site collections.  So I’ve been tasked with finding an alternate way to do it, and we’ve settled on using MOSS Search.

After a few false-starts (initially using WSS search instead of MOSS search), and some issues around dates (use ISO8601 format), my query ran like I expected and pulled back all the tasks.  I switched over to FullTextSqlQuery (to get the range comparison operators) and the core functionality of my custom search piece was done.

Quick-hit list of lessons from this:

  • Having 500+ separate site collections makes doing anything to all sites rather painful (and rather slow).  Now that 2010 is out, consider whether RBS can let you achieve the scale you’re looking for without death-by-a-million-site-collections.
  • When writing code against search, make sure you know whether you want to use WSS search (Microsoft.Sharepoint.Search) or MOSS search (Microsoft.Office.Server.Search).  Their APIs may look similar, but WSS search is less capable.  For one, WSS search doesn’t work with custom managed properties (it’ll throw an “InvalidPropertyException: Property doesn't exist or is used in a manner inconsistent with schema settings”).
  • KeywordQuery does simple queries well.  If you need something more complicated (like greater-than/less-than), you have to go to FullTextSqlQuery.
  • ‘Contains’ clauses (ie Contains(“value”) or Contains(field, “value”)) don’t support full wildcards – ie. “*value*” won’t match what you want it to – use ‘like’ instead (though that only works for a single field)
  • Dates for FullTextSqlQuery must be in ISO8601 format.
  • When exposing a new field to search, the procecedure is: Add field to list (if needed), incremental crawl (automatically adds crawled property), add managed property, full crawl.  Keep this in mind when scheduling a migration to a new environment, as that full crawl can take some time (4-8 hours in my 500+ site collection setup on dev hardware).
  • AssignedTo behaves oddly when it comes back in a search result.  If it has a single user, it comes back with “Joe Smith”.  If it has multiple, it comes back with “Joe Smith;#41;#Jane Smith;#42;#Roger Brown”.  I had to add logic for this field to remove the ;#ID;# chunks from this (split it up, keep only the 0th, 2nd, etc. items).  This works for my scenario because I don’t actually want those to be links anyway (hence I don’t need the IDs):

    var assignedTo = Convert.ToString(row["AssignedTo"]);
    var assignedTokens = assignedTo.Split(new string[] { ";#" }, StringSplitOptions.None);
    var assignedToNames = assignedTokens.Where((s, i) => i % 2 == 0).ToArray();
    row["AssignedTo"] = string.Join(", ", assignedToNames);

  • When your managed property has the ‘Include values from all crawled properties mapped’ flag set, you may get an array for that column instead of a single value (one farm gave me an array, the other didn’t).  None of my search results had multiple values, so I’m guessing the array is used if *any* item in the index has multiple values – if your managed properties are defined that way, you’ll need to fix it up – I have a previous blog post that covers dealing with multi-valued managed properties in Sharepoint Search.
  • If you’re seeing ‘odd’ values in some default fields, check the managed property definitions.  OOTB (and on my local dev environment), the definition of the managed property ‘Path’ – was such that non-document-library items had a value like DispForm.aspx?ID=x, while on our shared test/prod farms, it was ListName/X._000.  Turns out, the managed property definition was modified to try to fix an issue with a third-party library.  Since it ended up not helping that library, we just reverted the change back to the OOTB managed property definition, and all was well again.

Sure, using search for things that aren’t-quite-search has a few challenges, but when it’s all done and working together, it’s a solid solution.  In this case, it changed a feature that took so long to run that it’d time out into a feature that can do sub-second searches, and doesn’t rely on any custom code in event handlers to keep track of everything.