Category Archives: Administrative Tasks

Agent Management–List Primary and Failover Configuration

Something I don’t like about using the SDK (powershell) to manage agents, are the get* member cmdlet’s to return information – large scale queries take too long! The SDK is typically pretty slow in this regard, and that’s a shame because I find myself writing TSQL to accomplish tasks that the SDK should be able to promptly handle.

Recently I wrote some TSQL that will return all agents with their associated primary and failover management servers. This is very informative when the question "where does this agent failover to", and it’s a speedy way to implement some sort of automation process to expedite agent assignment.

 

Here you go!

 

SELECT rgv.SourceObjectPath AS [Agent], rgv.TargetObjectPath AS [ManagementServer], 
       CASE
              WHEN rtv.DisplayName = 'Health Service Communication' THEN 'Primary'
              ELSE 'Failover'
       END AS [Type]
FROM ManagedTypeView mt INNER JOIN
       ManagedEntityGenericView AS meg ON meg.MonitoringClassId = mt.Id INNER JOIN
       RelationshipGenericView rgv ON rgv.SourceObjectId = meg.Id INNER JOIN
       RelationshipTypeView rtv ON rtv.Id = rgv.RelationshipId
WHERE mt.Name = 'Microsoft.SystemCenter.Agent' AND
       rtv.Name like 'Microsoft.SystemCenter.HealthService%Communication' AND
       rgv.IsDeleted = 0
ORDER BY rgv.SourceObjectPath ASC, rtv.DisplayName ASC

 

Something like this would take several minutes in small to medium sized environments, and maybe upwards of 15-30 minutes in larger environments. This little bit of TSQL returns in 1-2 seconds. Eat that!

 

🙂

Troubleshooting network device discovery–snmputil

UPDATE (09/04/2015): SNMPUtil is just one method for checking connectivity, but these days I prefer to use SNMPWalk or some other utility. I’m keeping this post here for archival purposes.

One of the first things you might want check while troubleshooting network device discovery in OpsMgr 2012 is to verify whether the network discovery server can connect to to the SNMP agent on the device. There are a few reasons why a network discovery server cannot connect to the device via SNMP, and one of the easiest methods to test this is to use a the SNMPUtil tool. This tool was included with the Windows 2000 resource kit (which is becoming increasingly difficult to find).

Here is a simple command to use to test whether a device is reachable from the network discovery server:

snmputil getnext <IP Address or FQDN> <community string> .1.3

Just replace IP Address or FQDN and Community String with your varables.

If the device is reachable via SNMP, you will receive a message similar to the following:

image

If the device is not reachable, you will receive a message similar to the following:

image

Health Service Heartbeat Failure

I’ve seen plenty of questions come up in the forums and from customers regarding the Health Service Heartbeat Failure monitor, and its associated diagnostics and recoveries. I spent a little time digging further into these workflows and thought I’d share what I found here. Hope this helps those curious about what’s happening under the hood.

Communication Channel Basics

After an Operations Manager Agent is installed on a Windows computer, and after it is approved to establish a communication channel with an Operations Manager 2007 management group, the communication channel is maintained by the Health Service. If this communication channel is interrupted or dropped between the Agent and its primary Management Server (MS) for any reason, the Agent will make three attempts to re-establish communication with its primary MS, by default.

If the Agent is not able to re-establish the channel to its primary MS, it fails over to the next available MS. Failover configuration and the order of failover is another topic, and will not be covered here.

While the Agent is failed over to a secondary MS, it will attempt to re-establish communication with its primary MS every 60 seconds, by default. As soon as the Agent can establish communication with its primary MS again, it will disconnect from the secondary MS and fail back to its primary MS.

Health Service Heartbeat Failure Monitor

To briefly summarize the Heartbeat process, there are two configurable mechanisms that control Heartbeat behavior. Heartbeat interval and number of missed Heartbeats. If the MS fails to receive a Heartbeat from an Agent computer greater than the number of intervals specified, the Health Service Heartbeat Failure monitor will change to a critical state and generate an alert.

Read more about Heartbeat and configuration here.

Diagnostic and Recovery Tasks

There are a couple of diagnostic tasks that run when the Health Service Heartbeat Failure monitor changes to a critical state. Ping Computer on Heartbeat Failure and Check If Health Service Is Running.

Ping Computer on Heartbeat Failure

This diagnostic is defined in the Operations Manager 2007 Agent Management Library and is enabled by default. This workflow uses the Automatic Agent Management Account, which will run under the context of the Management Server Action Account by default, to execute a probe action which is defined in the Microsoft System Center Library named WmiProbe.

This probe is initiated on the Health Service Watcher. Since the Health Service Watcher is a perspective class hosted by the Root Management Server, this is where the WMI query is executed when the Health Service Heartbeat Failure monitor changes to a critical state. Even though the agent may be reporting to another MS, it is the RMS that sends the ICMP packet to the agent.

Unlike the traditional Ping.exe program we are all accustomed to, which sends four ICMP packets to the target host by default, the WMI query is executed only once and sends a single ICMP packet, so there is no calculation of percentage of lost packets one would expect to see with Ping.exe.

Following is the WMI query executed on the RMS.

SELECT * FROM Win32_PingStatus WHERE Address = ‘$Config/NetworkTargetToPing$’

To verify the number of ICMP packets sent, I ran a traditional Ping.exe test and the WMI query used in this workflow and traced these using Netmon. The first two entries in the image below were captured from the WMI query, and the last eight entries captured were from a Ping.exe test using default parameters (four packets).

WMI query vs. Ping.exe
image

The WMI query results are passed to a condition detection module, which filter StatusCode and execute the appropriate write action. If StatusCode <> 0, the write action ComputerDown will set state to reflect the computer is down. If StatusCode = 0, the write action ComputerUp will set state to reflect computer is up.

The condition detection modules that filter StatusCode are actually the recovery tasks shown in the Health Service Heartbeat Failure monitor. These are the reserved recoveries, Reserved (Computer Not Reachable – Critical) and Reserved (Computer Not Reachable – Success), respectively.

Under the covers, these reserved recoveries are actually setting state of the Computer Not Reachable monitor, which is defined in the System Center Core Monitoring MP. Ultimately, if StatusCode <> 0, the Computer Not Reachable monitor will change to a critical state and generate the Failed to Connect to Computer alert.

Since this is a diagnostic task which runs during a degraded state change event, the Agent will only be pinged once when the Health Service Heartbeat Failure monitor changes to a critical state. If there are any network related problems after this monitor has changed to critical and the diagnostic task has ran, there will be no further monitoring regarding the ping status of this Agent and no “Failed to Connect to Computer” alert will be generated.

We can understand the root cause better based on whether the Health Service Heartbeat Failure alert was generated along with the Failed to Connect to Computer alert. If the Health Service Heartbeat Failure alert generated without the Failed to Connect to Computer alert, logic would tell us that the issue is not related to loss of network connectivity or that the server has shutdown or become unresponsive. Both alerts together generally indicate the server is completely unreachable due to network outage, or the server is down or unresponsive.

Check if Health Service is Running

This diagnostic is defined in the Operations Manager 2007 Agent Management Library and is enabled by default. This workflow uses the Automatic Agent Management Account, which will run under the context of the Management Server Action Account by default, to initiate a probe action which is defined in the Operations Manager 2007 Agent Management Library named QueryRemoteHS.

Specifically, this probe is initiated on the Health Service Watcher and queries Health Service state and configuration on the Agent, when the Health Service Heartbeat Failure monitor changes to a critical state. This probe module type is further defined in the Windows Core Library. It takes computer name and service name as configuration, and passes the query results through an expression filter and returns the startup type and current state of the Health Service.

If the service doesn’t exist or the computer cannot be contacted, state will reflect this. Depending on output of the diagnostic task, optional recovery workflows may be initialized (i.e., reinstall agent, enable and start Health Service, and continue Health Service if paused), but these recoveries are not enabled by default.

SCOM SPN’s

There’s a lot of confusion about SPN’s (service principal name) when it comes to OpsMgr. How are SPN’s registered? When are SPN’s registered? Why aren’t SPN’s registering?

The purpose of this post is to give a snapshot of all the SPN’s that should be in your environment so you know you’ve get them all right. Here’s a birds-eye view.

Attention: Please read the last couple paragraphs in this post and my other post about SDK SPN Not Registered to gain a full understanding of the SDK SPN.

Just to clarify the list of SPN’s below:

* The SDK SPN is registered on the SDK service account in Active Directory. It references the RMS.
* The Health Service SPN is registered on the management server computer objects in Active Directory. It references its own computer object.

Note: SDK SPN’s do not have any operational impact on SCOM. The only SPN that operationally effects SCOM is the MSOMHSvc. In a clustered RMS configuration, the MSOMHSvc SPN needs to be set on the RMS Virtual Cluster object in Active Directory ONLY (and the MS computer objects, of course).

Root Management Server (non-clustered)

servicePrincipalName: MSOMSdkSvc/<RMS fqdn>
servicePrincipalName: MSOMSdkSvc/<RMS netbios name>
servicePrincipalName: MSOMHSvc/<RMS fqdn>
servicePrincipalName: MSOMHSvc/<RMS netbios name>

Root Management Server (clustered)

servicePrincipalName: MSOMSdkSvc/<RMS virtual fqdn>
servicePrincipalName: MSOMSdkSvc/<RMS virtual netbios name>
servicePrincipalName: MSOMHSvc/<RMS virtual fqdn>
servicePrincipalName: MSOMHSvc/<RMS virtual netbios name>

*Read additional information at bottom of article about clustered RMS SDK SPN

Management Server(s)

servicePrincipalName: MSOMHSvc/<MS fqdn>
servicePrincipalName: MSOMHSvc/<MS netbios name>

Management Server with ACS

servicePrincipalName: AdtServer/<MS fqdn>
servicePrincipalName: AdtServer/<MS netbios name>
servicePrincipalName: MSOMHSvc/<MS fqdn>
servicePrincipalName: MSOMHSvc/<MS netbios name>

Database Servers (including ACS DB)

servicePrincipalName: MSSQLSvc/<database netbios name>:1433
servicePrincipalName: MSSQLSvc/<database fqdn>:1433

Registering SPN’s with SETSPN

Non-Clustered RMS (SDK only)

SETSPN –A MSOMSdkSvc/<RMS netbios name> <your domain>\<sdk domain account>
SETSPN –A MSOMSdkSvc/<RMS fqdn> <your domain>\<sdk domain account>

Clustered RMS (SDK and Health Service)

SETSPN –A MSOMSdkSvc/<RMS virtual netbios name> <your domain>\<sdk domain account>
SETSPN –A MSOMSdkSvc/<RMS virtual fqdn> <your domain>\<sdk domain account>

SETSPN –A MSOMHSvc/<RMS virtual netbios name> <RMS virtual netbios name>
SETSPN –A MSOMHSvc/<RMS virtual fqdn> <RMS virtual netbios name>

Verifying SPN’s with SETSPN

SDK: SETSPN -L <your domain>\<sdk domain account>

HealthService: SETSPN -L <servername> (run this for each MS)

SQL Service: SETSPN -L <your domain>\<sql service account>

Verify SPN’s with LDIFDE

SDK and HealthServices: Ldifde -f c:\ldifde.txt -t 3268 -d DC=domain,DC=COM -r “(serviceprincipalname=MSOM*)” -l serviceprincipalname -p subtree

SQL Service: Ldifde -f c:\ldifde.txt -t 3268 -d DC=domain,DC=COM -r “(serviceprincipalname=MSSQLSvc*)” -l serviceprincipalname -p subtree

Note: You’ll most likely find multiple SPN’s for SQL Service. Just be sure there’s one for each of your OpsMgr DB role servers. If SQL runs under Local System, it will automatically register its SPN’s each time the service starts.

Question
A little more interesting information about clustered RMS SDK SPN

The SDK SPN is relative to the active node. Because of this, it is best to register the SDK SPN to the cluster network name, since this will always be associated to the active node. Because the SPN is relative to the active node, if we do register only the physical node SPN’s and not the cluster network name, we would need to connect to the active physical node in the console. Obviously, this is not very conducive and can be guess work.

With that said, we really don’t even need both Netbios and FQDN registered for the SPN. We could live with one or the other. What happens is, when we launch the console and enter the RMS name, this is the name that is used to establish the session with the SDK service. If we always supply only a Netbios name, we don’t even need the FQDN registered SPN (and vise-versa). But, again, it makes sense to create both netbios and FQDN SPN’s, especially in multi-domain environments.

Here’s another neat fact that you may not be aware of. We could create an alias in DNS, pointing it to the RMS clustered network name and both physical nodes. We can then register that alias as the SDK SPN and use that in our console connection settings. Now it doesn’t matter if we use the cluster network name, either of the physical nodes, or either netbios or FQDN.