Sunday, March 11, 2012

Getting the first SPMetal queries after application domain startup to not crash

I was recently asked to take a look at some exception stack traces from the production logs of a large intranet site that just went live and is now experiencing a lot of these exceptions calling SPMetal queries that didn’t show up during development and testing, and whenever they go looking for them, they don’t happen, but they regularly appear in the logs.  All of the exceptions had something like this at the top:

System.ArgumentException: An item with the same key has already been added.
at System.ThrowHelper.ThrowArgumentException(ExceptionResource resource)
at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
at System.Collections.Generic.Dictionary`2.Add(TKey key, TValue value)
at Microsoft.SharePoint.Linq.Rules.ToEnumerableProcessor.GetEnumerableOperator(MethodInfo qm)
at Microsoft.SharePoint.Linq.Rules.ToEnumerableProcessor.ConvertMethod(MethodCallExpression mce, Context ctx)
at Microsoft.SharePoint.Linq.Rules.ToEnumerableProcessor.<.cctor>b__13(MethodCallExpression mce, Context ctx)
at Microsoft.SharePoint.Linq.Rules.GuardedRule`4.<>c__DisplayClass3.<.ctor>b__1(TSourceBase src, TContext ctx)
at Microsoft.SharePoint.Linq.Rules.SwitchRule`3.Apply(TSource src, TContext ctx)

So what did I do to try to narrow this down?  Why ILSpy of course (queue the ILSpy theme music – it sounds a lot like the Mission Impossible theme to me, but whatever song plays in your head is fine).  Here’s where I started in ToEnumerableProcessor.GetEnumerableOperator:

image

So, ToEnumerableProcess.enumerableOpsByName, eh?

clip_image004

Yep, that’s a static dictionary.  If “typeof(Enumerable).GetMember(@operator.Name, MemberTypes.Method, BindingFlags.Static | BindingFlags.Public).Cast<MethodInfo>().ToArray<MethodInfo>()” takes a bit of time to run, we could probably end up with two threads inside that if block at the same time, then both do the Dictionary.Add call, and all but the first would get the exception we’re looking for.  The problem with this theory is that that wouldn’t explain why we get this error a number of times during the day – there’s only 50 different method names on Enumerable, and most of them are probably never used by our code, so we’d not expect to see many of these, even right after startup.  After talking to the team, I found out they’ve been making a number of manual tweaks to the web.config file for the site in the last week or so, which would explain why it the application domain is being rebuilt so many times, and why they’re seeing these errors so often.  One idea that has been floated is the idea of running *all* SPMetal queries while holding the same lock

Now that we have a theory, can we reproduce it?  To do this, we’d need a simple SPMetal context, and a way to test the theory – create a couple of threads, synchronize them (to eliminate thread startup time from the equation), then create the context and execute a query and see what happens.  Let’s just do a simple query: “cx.StyleLibrary.ToList()”.  What do we get?

System.ArgumentException: An item with the same key has already been added.
   at System.ThrowHelper.ThrowArgumentException(ExceptionResource resource)
   at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
   at System.Collections.Generic.Dictionary`2.Add(TKey key, TValue value)
   at Microsoft.SharePoint.Linq.Rules.ReflectionTreeRewriter`1.GetRewriter(Type type)
   at Microsoft.SharePoint.Linq.Rules.ReflectionTreeRewriter`1.GetChildren(TNodeBase node)

   at Microsoft.SharePoint.Linq.DataContext.GetList[T](String listName)
   at SPMetalThreadTest.WSP.SPMetalDefinition.SPMetalDefinitionContext.get_StyleLibrary()

Well, not exactly what we were expecting, but the same exception – we appear to be on the right track.  The trick to solving this is probably going to be doing a bunch of ‘fake’ queries once at startup (or before the first query) to get things “warmed up”.  Here’s a first stab at this ‘warmup’ logic (added to the partial DataContext class):

partial void OnCreated()

{

    Warmup();

}



private static bool _WarmedUp = false;

private static object _WarmupLock = new object();

private void Warmup()

{

    if (_WarmedUp)

        return;

    lock (_WarmupLock)

    {

        if (_WarmedUp)

            return;



        DoWarmup();

        _WarmedUp = true;

    }

}



private void DoWarmup()

{
Console.WriteLine("Running warmup logic"); var x = this.StyleLibrary; }

Let’s run it again and see what we get:

System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
   at System.Collections.Generic.Dictionary`2.Add(TKey key, TValue value)
   at Microsoft.SharePoint.Linq.Rules.ToEnumerableProcessor.GetEnumerableOperator(MethodInfo qm)

….
   at Microsoft.SharePoint.Linq.EntityList`1.GetEnumerator()
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)

Sure enough, a different error.  What if I use a different list?  Same error, so it appears using a single one of my lists is enough (though I probably would just add getters for all your lists to be safe – this code won’t run very often).  Ok, now to fix that error – maybe if I just add a .ToList() call to my Warmup logic?

Waiting for all threads to report ready
All threads reported ready - releasing
Running warmup logic
All threads exited

Yep, that worked.  What if I add a select clause to my test query?  No change.  A where?  No change.  A where and a select?  Bingo – another error:

System.ArgumentException: An item with the same key has already been added.
   at System.ThrowHelper.ThrowArgumentException(ExceptionResource resource)
   at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
   at System.Collections.Generic.Dictionary`2.Add(TKey key, TValue value)
   at Microsoft.SharePoint.Linq.Rules.ToEnumerableProcessor.GetEnumerableOperator(MethodInfo qm)
   at Microsoft.SharePoint.Linq.Rules.ToEnumerableProcessor.ConvertMethod(MethodCallExpression mce, Context ctx)
….

   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)

And this time, it even looks like the trace from our bug report.  We’re definitely on the right track here.  Adding a where and select to my ‘warmup’ query got rid of that error.  If I add a few more operators to my test query (Distinct, OrderBy, Max), I start getting the error again, so it seems my ‘warmup’ method needs to:

  1. Reference each list I’m going to use.
  2. Populate ToEnumerableProcessor.enumerableOpsByName by:
  3. Run one of every query style to be used (just calling everything from System.Linq.Enumerable in a query won’t be enough – some map to different operators or are combined)
    or
  4. Use Reflection to make a bunch of calls to Microsoft.SharePoint.Linq.Rules.ToEnumerableProcessor.GetEnumerableOperator to trick it into setting up everything we need.
  5. After taking a few runs at calling all the methods via #2.1, it’s becoming clear that LINQ-to-SP does some operator combinations that are fouling things up, so I’m trying to reflection approach, which seems to be working reliably:

    private void DoWarmup()
    
    {
    
        Console.WriteLine("Running warmup logic");
    
        var x = this.StyleLibrary.Count();
    
    
    
        var asm = typeof(DataContext).Assembly;
    
        var opUtilType = asm.GetType("Microsoft.SharePoint.Linq.OpUtil");
    
        var operatorFields = opUtilType.GetField("operatorFields", BindingFlags.NonPublic | BindingFlags.Static);
    
        var methods = ((Dictionary<MethodInfo, FieldInfo>)operatorFields.GetValue(null)).Keys.ToList();
    
        var toEnumerableProcessorType = asm.GetType("Microsoft.SharePoint.Linq.Rules.ToEnumerableProcessor");
    
        var getEnumerableOperator = toEnumerableProcessorType.GetMethod("GetEnumerableOperator", BindingFlags.NonPublic | BindingFlags.Static);
    
        foreach (var method in methods)
    
        {
    
            getEnumerableOperator.Invoke(null, new object[] { method });
    
        }
    
    
    
        Console.WriteLine("Warmup logic complete");
    
    }

    Great, problem solved, right?  Not quite.  What happens if I change my test query to be something that runs query operators against integers, something like “cx.StyleLibrary.Where(x => x.Id == 0).Select(x => x.Id).Distinct().OrderBy(x => x).Max()”?  Yep, it breaks again…

    System.ArgumentException: An item with the same key has already been added.
       at System.ThrowHelper.ThrowArgumentException(ExceptionResource resource)
       at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
       at System.Collections.Generic.Dictionary`2.Add(TKey key, TValue value)
       at Microsoft.SharePoint.Linq.Rules.ReflectionTreeRewriter`1.GetRewriter(Type type)
       at Microsoft.SharePoint.Linq.Rules.ReflectionTreeRewriter`1.GetChildren(TNodeBase node)
       at Microsoft.SharePoint.Linq.Rules.ChildRule`2.Apply(TNode src, TContext ctx)

       at Microsoft.SharePoint.Linq.Rules.ToEnumerableProcessor.Process(Expression e, List`1& assumptions)
       at Microsoft.SharePoint.Linq.SPLinqProvider.RewriteAndCompile[T](Expression expression, List`1& assumptions)
       at Microsoft.SharePoint.Linq.SPLinqProvider.System.Linq.IQueryProvider.Execute[T](Expression expression)
       at System.Linq.Queryable.Max[TSource](IQueryable`1 source)

    Not good. It appears I’m going to need to call Microsoft.SharePoint.Linq.Rules.ReflectionTreeRewriter<T>.GetRewriter(Type) for every type we’re dealing with.  This could get messy fast…  

    image

    image

    image

    Luckily, if ReflectionTreeRewriter<T>.GetRewriterForType(Type) returns null, an exception gets thrown, the default returns null, and the only subclass is the non-generic Microsoft.SharePoint.Linq.Rules.ExpressionTreeRewriter, so this may be doable.  ExpressionTreeRewriter exposes a singleton instance of itself via it’s “Singleton” field, so let’s try adding this to the ‘DoWarmup’ method:

    var expressionTreeRewriterType = asm.GetType("Microsoft.SharePoint.Linq.Rules.ExpressionTreeRewriter");
    
    var singletonField = expressionTreeRewriterType.GetField("Singleton", BindingFlags.Static | BindingFlags.Public);
    
    var expressionTreeRewriter = singletonField.GetValue(null);
    
    var getRewriterMethod = expressionTreeRewriterType.BaseType.GetMethod("GetRewriter", BindingFlags.NonPublic | BindingFlags.Instance);
    
    
    
    foreach (var type in types)
    
    {
    
        getRewriterMethod.Invoke(expressionTreeRewriter, new object[] { type });
    
        if (type.IsGenericType && type.GetGenericTypeDefinition() == typeof(Nullable<>))
    
        {
    
            var baseType = type.GetGenericArguments()[0];
    
            getRewriterMethod.Invoke(expressionTreeRewriter, new object[] { baseType });
    
        }
    
    }

    Still no dice.  Ok, let’s try a simple retry of the query then.  Add this extension method, and call it instead of ToList():

    public static List<T> ToListSafe<T>(this IEnumerable<T> enumeration)
    
    {
    
        try
    
        {
    
            return enumeration.ToList();
    
        }
    
        catch (ArgumentException)
    
        {
    
            // add the sleep to work for complicated queries too
    
            System.Threading.Thread.Sleep(250);
    
            return enumeration.ToList();
    
        }
    
    }

    And it works!  (Adding in a “System.Threading.Thread.Sleep(250)” in the catch block makes it even handle some really complicated queries).  In fact, we can get rid of the reflection calls in DoWarmup – just leave the list references there, make sure all of the calls that actually cause queries to execute (such as .ToList, .ToArray, .ToDictionary, .FirstOrDefault(), etc.) are called with a retry if ArgumentException is thrown (like the ToListSafe extension method above), and those problems are history.

    Ok, so we’ve come up with hacks to work around this – the obvious next question is…. WHY?  Well, I can’t answer that.  It appears that the SharePoint LINQ libraries try to heavily take advantage of the performance boost of caching expression trees and the other components that make up query execution.  That’s not problematic by itself, if done correctly, but unfortunately, it is done in a way that isn’t thread-safe, since Dictionary<K,V> is *not* thread-safe, and no measures are taken to protect against problems caused by that.  If you find yourself in a situation like this in .Net 3.5, the best approach is probably to use a System.Collections.Hashtable (which *is* thread-safe, even if it’s not generic), and make sure you don’t call the .Add method (use the indexer instead).  If you’re in .Net 4.0 (which SharePoint 2010 obviously isn’t), ConcurrentDictionary<K,V> is another option, since that *is* thread-safe.

     

    Well, thanks for coming along on yet another dive into the internals of SharePoint to find out why something isn’t behaving like I want it to.  The final sample code for this test scenario with the final fix is attached for your experimentation (SPMetalThreadTest.zip).  Now, if only I can get iPhones to behave on my site… ah… a story for another day.