Thursday, March 31, 2016

Workflow Monitoring in SharePoint Online

Steve Tonkin, Solution Architect

A major concern in the transition from workflows with SharePoint On-Premises to SharePoint Online is that the monitoring features that had previously been available are now limited. This leave us without a way to know when and if a workflow has been suspended. Let’s walk through a quick and dirty way to add workflow monitoring to our cloud-based workflow.

One small note before we dive in, this will work for both Nintex Online and SharePoint Designer Workflows. I’ve chosen to write and show it in Nintex because it’s both easier to create and easier to read.

Why do we have to do this?

The internal status field in workflows is only available by querying the database directly. Since this can’t be done in due to Office 365 restrictions on database access, we cannot set workflows or alerts based on this field. The internal status field, however, is the only way we really know for sure if a workflow has been suspended.

The nice lookup column that Nintex uses On-Premises (where the workflow name is the column title, then creates that lookup to the internal status field) is also not available in SharePoint Online for this reason.

So, what can we use?

Remember the old phrase, “When SharePoint closes a door, it opens an extremely small window, with steel bars and barbed wire and rat traps waiting on the floor on the other side?”

Probably not, because I just made it up, but it’s true.

First, we need to locate the hidden Workflow History List. You can do this by navigating to your site where the workflows are running, then appending this URL:

https://yourcompany.sharepoint.com/yoursite/Lists/Workflow%20History/

Then we need to create a view that allows us to better track the status of incoming history items. I like to see the Parent Instance, Association ID, Date Occurred, and Description:

clip_image002

The important field here is Description. This logs messages when workflows enter a suspended state with an error message (most of the time).

Remember that awesome door/window phrase I made up? It’s true here too. Because these history items are ‘special’ in SharePoint, we cannot start a workflow on them when they are created. This means we can’t just run workflows every time something is logged to this list.

So, how do we build it?

Start by taking all workflows currently running and whenever a description is logged, you’ll need to append a token to the end of it. So, for instance if I log that a workflow started, we need to add this unique token to the end of the text, like so:

clip_image003

Now for every time something is entered into Workflow History that we have control over it will have this, RP?WoRkf1oW, token.

So, now we need to create a monitoring list, which can be on the site running workflows or it its own separate site. Create a Custom List and leave everything default.

Now let’s start the workflow build… here we log that the workflow started first, then we’re sending an email to tell us that it started (more on that later).

clip_image004

Next, we need to get a few different times as variables, like the current time and the time when our log started. For this example, we are logging hourly, so we need to get the current time minus one hour.

clip_image006

Depending on where you are in the world, you’ll need to make more adjustments based on UTC, which is the default record for each timestamp. So, for us, we’re logging the start time as a variable (to use in email), then adjusting for UTC and setting a second variable (to use in queries):

clip_image007

Now that we have our log’s timeframe correct, we can run our queries against the Workflow History list. Please note that we need to use “Output as an ISO 8601 date string” to define the UTC variable to work in queries.

clip_image008

Because we cannot use “Does not contain” as a variable in these list queries, we need to run it twice. Here is what the setup for the query looks like:

We define the Workflow History list:

clip_image010

We provide credentials:

clip_image012

We then build our query:

clip_image014

You can see in this first query, we’re getting only items that contain the token. Then we are also running a filter on our adjusted UTC time variable (remember, this is the ISO time output):

clip_image015

Now we log the total number of items returned with a token into ‘Result Count’:

clip_image017

Then we repeat that query but for all items in that timeframe, configure all parts of the second query the same, except for:

1. Add query fields for ID and Date Occurred:

clip_image019

2. Remove the filter field for the token, and just filter by UTC time:

clip_image020

3. Save the ‘Result Count’ into a different variable and save the ‘Fields XML’ into a Collection variable:

clip_image022

Ok, now that we know the total number of items in Workflow History in the past hour AND the number of items with our unique token, we can determine if there are any potential items of concern. We run a conditional that asks if the counts are equal and if yes we just add a log and move on.

clip_image024

But if they are not equal, we start to worry. First, let’s get the difference in the two counts. Then we are going to start crafting an email notification to show us the problem items.

If you build it…

You can really craft an email any way you want, but here’s what works for us. First, this is how we get our variables:

clip_image025

I’m going to use a single variable to build an HTML table inside an email. First thing I need to do is to set up my table headings in the variable:

clip_image027

Now that I have that variable set, we will set a run each action that will go through our collection that we set.

clip_image029

Then for each of the fields that we queried from the Workflow History list (Description, Date Occurred, ID), we’re going through that output variable for each item in the collection. We’ll run that in an XPath query and get out plain text:

clip_image031

Repeat that XPath query for both ID and Date Occurred and drop them into different text variables.

So now we have text variables for all items in the past hour. FINALLY, we can run ‘does not contain’ statements to weed out those that we have cleared with our token. Create a ‘Run If’ statement that pulls out items with the token:

clip_image033

Now we’ll start to build the email table to send for review. It now only contains our problem items thanks to the ‘Run If’ removing those with our unique token.

clip_image034

Up first, we need to get back our timestamps for each item in Workflow History. Since these are logged in ISO Date/Time format, we need to get a substring of only the date and save it as a text string.

clip_image036

Then we can take that plain text variable and transform it into a Date/Time:

clip_image038

Then we adjust again for UTC, since our Occurred date is UTC we reverse and get it back to our local time to display:

clip_image040

So now we have all our variables set the way we want for our notification. We’ll take the variable we set for the table header and append table rows with our variables that we pulled in:

clip_image042

To break this down we:

1. Start with the table variable first, since we loop through this for each item.

2. Add a column with a link:

a. URL: Hardcode the location to your Workflow History list, then append [../DispForm.aspx?ID=] and then add your ID text variable.

b. Then set the text of the link to the Description text variable and close out the link.

3. Then add a column with the Date/Time variable we adjusted in the previous step and close the row.

This will loop through for all items without tokens.

 

Now we get out of the ‘Run If’ and out of the ‘For Each’:

clip_image043

We’ll close out the HTML table variable:

clip_image045

Then we will build the email notification with all the items missing tokens:

clip_image047

Again, you can craft this email to meet your needs, but this is a nice way to get the important information. Here is what it looks like once we get items coming through:

clip_image048

Now we have pushed that email through. Since we are logging for the past hour, now we set a stop to pause until the next log needs to be started.

clip_image049

Then we set our end time variable to collect after that pause. That will help us name the log item we are creating in the next step that will continue this process again:

clip_image051

The workflow ends, but a new log item in the monitoring list has been created to keep this going. Simple right?

It is important to note for this to work, the settings need to allow start when items are created:

clip_image052

Publish and then create a first item in the list and this should start up the workflow and then it will keep itself going.

clip_image053

But wait, what if the monitoring workflow gets suspended?

That’s why we are creating new items on each workflow!

Remember when I said we would have more on that first email notification we sent when the workflow started? No? It’s ok since that was like 2000 words ago. To refresh memories, when we start the workflow, we are sending an email:

clip_image004[1]

Using that email, what we did was create a rule that moves just that email to a folder. That way I don’t get bothered with an ‘Everything is OK’ email once an hour.

Rule: When [Subject] equals [Monitoring Email Subject] move to folder.

We do get to see the unread email count in that folder. So after setting up that rule we then set a custom retention policy on the folder that deletes anything older than 24 hours.

clip_image054

This way our unread count will always be at 24 UNLESS the monitoring workflow has stalled. It’s passive, but it is a way to monitor a workflow that is monitoring other workflows.

clip_image055

Anything else?

So, you made it all the way to the end and I think we can agree this is not a perfect solution. However, since there is no built in monitoring for SharePoint Online yet, this at least provides a way to be notified if your important workflows fail. Hopefully there is an out of the box way to get this functionality coming soon but at least now we have a way to get alerted in a semi-timely, semi-passive way.