Problem:
Recently I am trouble-shooting some production Workflow (WF) issues .Everything works fine in dev and Stage but not in production. The problem we are having is that , we use code to trigger the WF asynchronously (i.e. SharePoint workflow timer job needs to pick it up and invoke the workflow , see below for the code) and it failed the first time and then start working 10 mins later see screenshot below also it work fine when trigger the workflow manually . First thing we did is to enable workflow related logging as described in my previous log but find nothing.
Here are the code we use to invoke the workflow:
SPWorkflow wf = CurrentSite.WorkflowManager.StartWorkflow(item, workflowAssociation, “<Data></Data>”, SPWorkflowRunOptions.Asynchronous);
Solution:
After some more research, we find out the topology of the Production farm is different from staging, in staging we have 2 apps server and 2 web front end servers and Microsoft SharePoint Foundation Workflow Timer Service (SFTS) is started by default on all the servers in the farm ( We have 6 Server – 2 WFE, 2 APP/ Crawl(Index), 2 DBs) and We have been observing that, this STFS, running on Crawl and App server is most likely causing problem for the workflow failure. After stopping this service in APP servers ,the workflow works like a charm.
Here is summary of the issues and solution I grabbed from MSDN Forum:
Problem:
· A state machine workflow is deployed on multi server SharePoint server 2010 farm.
· Workflow has DelayActivity used in multiple states.
· Workflow(s) logs an error in workflow history list as “<workflow name> failed to run” (randomly, no specific pattern)
· ULS logs, Event Viewer has no error logged
Analysis:
· I understand that ( I would love to get my understanding corrected if not the case) during processing of delayactivity by Workflow, a timerjob is created and scheduled/added on (or may be picked up by) the server(s) who has Microsoft SharePoint Foundation Workflow Timer Service running on it.
· As a part of executing this timer job(on time maturity), server (WFE/Crawl/APP) try to process the instruction which in this case rescheduling the workflow execution and this requires workflow assembly to be available on this server. (Do read this very interesting post if want to understand how workflows are executed http://www.the14folder.com/2010/07/25/migrating-workflows-question)
· Now you may be wondering
Should Workflow assembly be present on this server roles?
If yes then how does Workflow assembly go missing from Crawl, App server
· Well the culprit was a value ‘WebFrontEnd’ of attribute ‘DeploymentServerType’ in a Solution manifest file. This has caused the solution deployment process to copy the Workflow assembly only to WFE’s and not on Crawl and App Server roles (http://msdn.microsoft.com/en-us/library/ms412929.aspx)
· Where in, since Microsoft SharePoint Foundation Workflow Timer service was running on Crawl and App servers as well, timerjob execution was failing with either “Feature not found” and/or “Assembly cannot be loaded information in ULS logs (you will find these only if you enable verbose level logging http://technet.microsoft.com/en-us/library/ee748656.aspx)
Solution:
· We stopped the Microsoft SharePoint Foundation Workflow Timer Service on the Crawl and App server roles, since as referred by the Lily, it is not recommended to have these service running on App and Crawl server roles.
Take Aways:
· Be sure of the attributes that you choose in your Solution manifest file.
· Enable the verbose level logging if you do not see any error in ULS logs, Event Viewer
· Make sure that the services running on each of the server roles are MUST to have and you know why you have chosen it that way.
· Stop the Microsoft SharePoint Foundation Workflow Timer Service on the server roles where you not intend to deploy your solution(s) that would require this service.
References:
