Health Service Heartbeat Failure

I’ve seen plenty of questions come up in the forums and from customers regarding the Health Service Heartbeat Failure monitor, and its associated diagnostics and recoveries. I spent a little time digging further into these workflows and thought I’d share what I found here. Hope this helps those curious about what’s happening under the hood.

Communication Channel Basics

After an Operations Manager Agent is installed on a Windows computer, and after it is approved to establish a communication channel with an Operations Manager 2007 management group, the communication channel is maintained by the Health Service. If this communication channel is interrupted or dropped between the Agent and its primary Management Server (MS) for any reason, the Agent will make three attempts to re-establish communication with its primary MS, by default.

If the Agent is not able to re-establish the channel to its primary MS, it fails over to the next available MS. Failover configuration and the order of failover is another topic, and will not be covered here.

While the Agent is failed over to a secondary MS, it will attempt to re-establish communication with its primary MS every 60 seconds, by default. As soon as the Agent can establish communication with its primary MS again, it will disconnect from the secondary MS and fail back to its primary MS.

Health Service Heartbeat Failure Monitor

To briefly summarize the Heartbeat process, there are two configurable mechanisms that control Heartbeat behavior. Heartbeat interval and number of missed Heartbeats. If the MS fails to receive a Heartbeat from an Agent computer greater than the number of intervals specified, the Health Service Heartbeat Failure monitor will change to a critical state and generate an alert.

Read more about Heartbeat and configuration here.

Diagnostic and Recovery Tasks

There are a couple of diagnostic tasks that run when the Health Service Heartbeat Failure monitor changes to a critical state. Ping Computer on Heartbeat Failure and Check If Health Service Is Running.

Ping Computer on Heartbeat Failure

This diagnostic is defined in the Operations Manager 2007 Agent Management Library and is enabled by default. This workflow uses the Automatic Agent Management Account, which will run under the context of the Management Server Action Account by default, to execute a probe action which is defined in the Microsoft System Center Library named WmiProbe.

This probe is initiated on the Health Service Watcher. Since the Health Service Watcher is a perspective class hosted by the Root Management Server, this is where the WMI query is executed when the Health Service Heartbeat Failure monitor changes to a critical state. Even though the agent may be reporting to another MS, it is the RMS that sends the ICMP packet to the agent.

Unlike the traditional Ping.exe program we are all accustomed to, which sends four ICMP packets to the target host by default, the WMI query is executed only once and sends a single ICMP packet, so there is no calculation of percentage of lost packets one would expect to see with Ping.exe.

Following is the WMI query executed on the RMS.

SELECT * FROM Win32_PingStatus WHERE Address = ‘$Config/NetworkTargetToPing$’

To verify the number of ICMP packets sent, I ran a traditional Ping.exe test and the WMI query used in this workflow and traced these using Netmon. The first two entries in the image below were captured from the WMI query, and the last eight entries captured were from a Ping.exe test using default parameters (four packets).

WMI query vs. Ping.exe
image

The WMI query results are passed to a condition detection module, which filter StatusCode and execute the appropriate write action. If StatusCode <> 0, the write action ComputerDown will set state to reflect the computer is down. If StatusCode = 0, the write action ComputerUp will set state to reflect computer is up.

The condition detection modules that filter StatusCode are actually the recovery tasks shown in the Health Service Heartbeat Failure monitor. These are the reserved recoveries, Reserved (Computer Not Reachable – Critical) and Reserved (Computer Not Reachable – Success), respectively.

Under the covers, these reserved recoveries are actually setting state of the Computer Not Reachable monitor, which is defined in the System Center Core Monitoring MP. Ultimately, if StatusCode <> 0, the Computer Not Reachable monitor will change to a critical state and generate the Failed to Connect to Computer alert.

Since this is a diagnostic task which runs during a degraded state change event, the Agent will only be pinged once when the Health Service Heartbeat Failure monitor changes to a critical state. If there are any network related problems after this monitor has changed to critical and the diagnostic task has ran, there will be no further monitoring regarding the ping status of this Agent and no “Failed to Connect to Computer” alert will be generated.

We can understand the root cause better based on whether the Health Service Heartbeat Failure alert was generated along with the Failed to Connect to Computer alert. If the Health Service Heartbeat Failure alert generated without the Failed to Connect to Computer alert, logic would tell us that the issue is not related to loss of network connectivity or that the server has shutdown or become unresponsive. Both alerts together generally indicate the server is completely unreachable due to network outage, or the server is down or unresponsive.

Check if Health Service is Running

This diagnostic is defined in the Operations Manager 2007 Agent Management Library and is enabled by default. This workflow uses the Automatic Agent Management Account, which will run under the context of the Management Server Action Account by default, to initiate a probe action which is defined in the Operations Manager 2007 Agent Management Library named QueryRemoteHS.

Specifically, this probe is initiated on the Health Service Watcher and queries Health Service state and configuration on the Agent, when the Health Service Heartbeat Failure monitor changes to a critical state. This probe module type is further defined in the Windows Core Library. It takes computer name and service name as configuration, and passes the query results through an expression filter and returns the startup type and current state of the Health Service.

If the service doesn’t exist or the computer cannot be contacted, state will reflect this. Depending on output of the diagnostic task, optional recovery workflows may be initialized (i.e., reinstall agent, enable and start Health Service, and continue Health Service if paused), but these recoveries are not enabled by default.

SCOM SPN’s

There’s a lot of confusion about SPN’s (service principal name) when it comes to OpsMgr. How are SPN’s registered? When are SPN’s registered? Why aren’t SPN’s registering?

The purpose of this post is to give a snapshot of all the SPN’s that should be in your environment so you know you’ve get them all right. Here’s a birds-eye view.

Attention: Please read the last couple paragraphs in this post and my other post about SDK SPN Not Registered to gain a full understanding of the SDK SPN.

Just to clarify the list of SPN’s below:

* The SDK SPN is registered on the SDK service account in Active Directory. It references the RMS.
* The Health Service SPN is registered on the management server computer objects in Active Directory. It references its own computer object.

Note: SDK SPN’s do not have any operational impact on SCOM. The only SPN that operationally effects SCOM is the MSOMHSvc. In a clustered RMS configuration, the MSOMHSvc SPN needs to be set on the RMS Virtual Cluster object in Active Directory ONLY (and the MS computer objects, of course).

Root Management Server (non-clustered)

servicePrincipalName: MSOMSdkSvc/<RMS fqdn>
servicePrincipalName: MSOMSdkSvc/<RMS netbios name>
servicePrincipalName: MSOMHSvc/<RMS fqdn>
servicePrincipalName: MSOMHSvc/<RMS netbios name>

Root Management Server (clustered)

servicePrincipalName: MSOMSdkSvc/<RMS virtual fqdn>
servicePrincipalName: MSOMSdkSvc/<RMS virtual netbios name>
servicePrincipalName: MSOMHSvc/<RMS virtual fqdn>
servicePrincipalName: MSOMHSvc/<RMS virtual netbios name>

*Read additional information at bottom of article about clustered RMS SDK SPN

Management Server(s)

servicePrincipalName: MSOMHSvc/<MS fqdn>
servicePrincipalName: MSOMHSvc/<MS netbios name>

Management Server with ACS

servicePrincipalName: AdtServer/<MS fqdn>
servicePrincipalName: AdtServer/<MS netbios name>
servicePrincipalName: MSOMHSvc/<MS fqdn>
servicePrincipalName: MSOMHSvc/<MS netbios name>

Database Servers (including ACS DB)

servicePrincipalName: MSSQLSvc/<database netbios name>:1433
servicePrincipalName: MSSQLSvc/<database fqdn>:1433

Registering SPN’s with SETSPN

Non-Clustered RMS (SDK only)

SETSPN –A MSOMSdkSvc/<RMS netbios name> <your domain>\<sdk domain account>
SETSPN –A MSOMSdkSvc/<RMS fqdn> <your domain>\<sdk domain account>

Clustered RMS (SDK and Health Service)

SETSPN –A MSOMSdkSvc/<RMS virtual netbios name> <your domain>\<sdk domain account>
SETSPN –A MSOMSdkSvc/<RMS virtual fqdn> <your domain>\<sdk domain account>

SETSPN –A MSOMHSvc/<RMS virtual netbios name> <RMS virtual netbios name>
SETSPN –A MSOMHSvc/<RMS virtual fqdn> <RMS virtual netbios name>

Verifying SPN’s with SETSPN

SDK: SETSPN -L <your domain>\<sdk domain account>

HealthService: SETSPN -L <servername> (run this for each MS)

SQL Service: SETSPN -L <your domain>\<sql service account>

Verify SPN’s with LDIFDE

SDK and HealthServices: Ldifde -f c:\ldifde.txt -t 3268 -d DC=domain,DC=COM -r “(serviceprincipalname=MSOM*)” -l serviceprincipalname -p subtree

SQL Service: Ldifde -f c:\ldifde.txt -t 3268 -d DC=domain,DC=COM -r “(serviceprincipalname=MSSQLSvc*)” -l serviceprincipalname -p subtree

Note: You’ll most likely find multiple SPN’s for SQL Service. Just be sure there’s one for each of your OpsMgr DB role servers. If SQL runs under Local System, it will automatically register its SPN’s each time the service starts.

Question
A little more interesting information about clustered RMS SDK SPN

The SDK SPN is relative to the active node. Because of this, it is best to register the SDK SPN to the cluster network name, since this will always be associated to the active node. Because the SPN is relative to the active node, if we do register only the physical node SPN’s and not the cluster network name, we would need to connect to the active physical node in the console. Obviously, this is not very conducive and can be guess work.

With that said, we really don’t even need both Netbios and FQDN registered for the SPN. We could live with one or the other. What happens is, when we launch the console and enter the RMS name, this is the name that is used to establish the session with the SDK service. If we always supply only a Netbios name, we don’t even need the FQDN registered SPN (and vise-versa). But, again, it makes sense to create both netbios and FQDN SPN’s, especially in multi-domain environments.

Here’s another neat fact that you may not be aware of. We could create an alias in DNS, pointing it to the RMS clustered network name and both physical nodes. We can then register that alias as the SDK SPN and use that in our console connection settings. Now it doesn’t matter if we use the cluster network name, either of the physical nodes, or either netbios or FQDN.

SCOM–seal a management pack

Sealing a Management Pack is easy. Although, it can be frustrating the first time through. It’s a process that requires a few different pieces to interact, so preparation is key. Going through some simple steps now will save time in the future.

  • Create a directory somewhere on a workstation where you’ll be sealing MP’s. For this example, I created the directory c:\MPS
  • I also created four directories within c:\MPS
    • \Input – this directory will contain the MP to be sealed (the xml file)
    • \Output – this directory will contain the sealed MP (the final mp file)
    • \MP – this directory will contain all referenced MP’s
    • \Key – this directory will contain the pair key file
  • Copy MPSeal.exe from the installation media “SupportTools” directory to the c:\MPS directory.
  • Copy sn.exe to the c:\MPS directory
  • Copy your unsealed MP (xml file) into the \Input directory
  • Copy all the *.mp files from the RMS installation directory into the \MP directory
    • Usually “%Program Files%\System Center Operations Manager 2007\”
  • Also, copy all *.mp files that you’ll be referencing to the \MP directory
    • TIP: I’d just keep this directory updated with all available current MP’s (ie; Active Directory, Exchange, etc)

Finally, the c:\MPS directory will look like this.

image

The two files highlighted:
Command.txt is just a file I created that contains the commands needed to seal the management pack. The MPResources.resources file is automatically created while sealing management packs. This is not anything you’ll need to copy into the directory.

Now, we’re ready to seal our Management Pack.

Open a command prompt and navigate to your work directory (c:\MPS). Run these commands in sequence. (beware of word wrap with these commands)

  • sn -k c:\mps\key\PairKey.snk
  • sn -p c:\mps\key\PairKey.snk c:\mps\key\PubKey
  • sn -tp c:\mps\key\PubKey
  • mpseal c:\mps\input\unsealed_mp.xml /I “c:\mps\mp” /Keyfile “c:\mps\key\PairKey.snk” /Company “Your Company” /Outdir “c:\mps\output”

You should now have your sealed MP in the Output directory. And, you’ll have a working directory for later use. Just remember to keep the MP versions in the c:\MPS\MP directory current with your Management Groups. Otherwise, you’ll get version errors while attempting to run the MPSeal tool.

Hint: Once you’ve created the key the first time around, it’s not necessary to create a new key each time you seal a MP. The current key may be reused. So, the only step you’ll need to actually do after the first run is the last step. How’s that for easy!

A note to developers: I’ve had some questions about where the MPResources.resources file mentioned above is created. Specifically, if two build flavor threads (x64 and x86, for example), compiles at same time and try to create this file under sources, one build thread will break.

To solve that problem, execute MPSeal from a different location. Examples below.

This will create the MPResources.resources file in the users %temp% directory.clip_image002

This will create the MPResources.resources file in the x86 directory I created.clip_image004

This will create the MPResources.resources file in the x64 directory I created.clip_image006

Move the Health Service State directory on the RMS

 

1. Ensure the destination volume is formatting to 64k allocation unit size for best performance.
To partition a drive: http://technet.microsoft.com/en-us/library/cc722475.aspx

2. If the Health Service State directory will not be located at the root of the new volume, create the destination folder that will host it.
Note: The Health Service State folder will be automatically created when the services are restarted, so it is not necessary to create this folder.

3. Stop all SCOM services (Health Service, Config Service and SDK Service).

4. Open REGEDIT and navigate to HKLM\System\CurrentControlSet\Services\HealthService\Parameters

5. Modify the State Directory string value with the path to the new location of the Health Service State directory.
Example: H:\Health Service State

6. Close the Registry Editor.

7. Start all SCOM services (Health Service, Config Service and SDK Service).

8. Open the new location that hosts the Health Service State folder and verify the contents have been created successfully.

9. Open the Operations Manager event log to verify there are no errors from source; Health Service, Health Service Modules, Health Service ESE Store.